如何避免換行導致行邊緣出現短單字？

Question

這裡有兩個目標：

不要在緊跟標點符號的短單字後中斷，
不要在緊接標點符號之前的短單字之前中斷，

受到良好斷線的定期約束。

一個簡單的解決方案是將標點符號聲明為特別好的中斷位置（負懲罰，幅度足夠大）。這將使 TeX 權衡標點符號的斷行與其他斷行考慮因素（壞處、過失、其他懲罰），但不能保證絕對不會出現此類斷行。

這是之前和之後的情況，以進行說明：

如你看到的，

在第一段中，, it第三行末尾的已移至更改後的下一行。
在第二段中，el.第四行開頭的和at,第六行開頭的已移至更改後的上一行。
包含第三段是為了表明這個技巧並不是一個保證：it.第四行開頭的仍然保留在那裡，因為根本沒有辦法將它放入前一行。

這是透過以下方式實現的：

\catcode`.=\active \def.{\char`.\penalty -200\relax}
\catcode`,=\active \def,{\char`,\penalty -200\relax}

在以下文件中：

\documentclass{article}
\begin{document}
\frenchspacing % Makes it easier
\hsize=20em
\parskip=10pt

% First, three paragraphs with the default settings
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

\pagebreak

% Now the same text, with the meanings of . and , changed.
\catcode`.=\active \def.{\char`.\penalty -200\relax}
\catcode`,=\active \def,{\char`,\penalty -200\relax}

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

% Change it back
\catcode`.=12 \catcode`,=12
\pagebreak

% Same text again, to show that nothing's permanently changed.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

\end{document}

筆記:

如果像這樣改變.和的含義會破壞某些東西，我不會感到驚訝。,（事實上，我很驚訝這個例子中沒有出現任何混亂，然後我意識到catcode更改不適用於已經讀入的令牌。）
您可以調整懲罰：我使用 -200 只是作為範例，但從 -1 到 -9999 的任何值都會有一些影響。（在本例中，所有這些變更生效的閾值似乎是 -175，儘管即使在 -100 時也會發生一個變更。）≤ -10000 的懲罰會強制換行，這不是您想要的。
您可以對更多標點字元 ( ?!:;) 執行相同的操作，或對不同的標點字元設定不同的懲罰。
（預設）事情有點困難\nonfrenchspacing，標點符號後的空格更大。這可能是可行的，但提出這些例子需要大量工作，所以我沒有繼續這樣做。留作練習:-)
使用 LuaTeX，您甚至可以更改換行演算法，這將是一種很酷的方法保證行邊緣沒有簡短的單字（如果這是您需要的）。

編輯：我無法抗拒在 LuaTeX 中實現「有保證」的解決方案。此版本應該適用於\frenchspacing和\nonfrenchspacing。它的作用是檢測某些序列並插入無限（10000）的懲罰以防止中斷：

(punct, space, short_word, space) -> (punct, space, short_word, penalty, space)

和

(space, short_word, punct) -> (penalty, space, short_word, punct)

對於上面的例子，這會產生：

請注意最後一段的溢出框，因為約束非常嚴格，但這正是我們所要求的。（無論如何，您可能不會有過滿的框，其中包含更寬和更長的段落，並且您可以通過重寫或添加\emergencystretch等通常的方式來修復它們。）

產生上述內容（甚至是這個想法）的程式碼很可能有錯誤，甚至可能導致您的 LuaTeX 編譯崩潰，但它是：

\documentclass{article}
\directlua{dofile("strict.lua")}
\begin{document}
\frenchspacing % Keeping same example as before
\hsize=20em
\parskip=10pt

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.
\end{document}

哪裡strict.lua：

function is_punct(n)
   if node.type(n.id) ~= 'glyph' then return false end
   if n.char > 127 then return false end
   c = string.char(n.char)
   if c == '.' or c =='?' or c == '!' or c == ':' or c == ';' or c == ',' then
      return true
   end
   return false
end

function no_punct_short_word_eol(head)
   -- Prevents having a line that ends like "<punctuation><space><short_word>"
   -- How we do this:
   --   (1) detect such short words (punct, space, short_word, space)
   --   (2) insert a penalty of 10000 between the short_word and the following space.
   -- More concretely:
   --   * A punctuation is one of .?!:;, which are the ones affected by \frenchspacing
   --   * A space is any glue node.
   --   * A short_word is a sequence of only glyph and kern nodes.
   -- So we maintain a state machine: default -> seen_punct -> seen_space -> seen_word
   -- where in the last state we maintain length. If we're in seen_word state and we see
   -- a glue, and length is less than threshold, insert a penalty before the glue.
   state = 'default'
   root = head
   while head do
      if state == 'default' then
         if is_punct(head) then
            state = 'seen_punct'
         end
      elseif state == 'seen_punct' then
         if node.type(head.id) == 'glue' then
            state = 'seen_space'
         else
            state = 'default'
         end
      elseif state == 'seen_space' then
         if node.type(head.id) == 'glyph' then
            state = 'seen_word'
            length = 1
         elseif is_punct(head) then
            state = 'seen_punct'
         else
            state = 'default'
         end
      elseif state == 'seen_word' then
         if node.type(head.id) == 'glue' and length <= 2 then
            -- Moment of truth
            penalty = node.new('penalty')
            penalty.penalty = 10000
            root, new = node.insert_before(root, head, penalty)
            -- TODO: Is 'head' invalidated now? Docs don't say anything...
            state = 'default'
         elseif node.type(head.id) == 'glyph' or node.type(head.id) == 'kern' then
            if node.type(head.id) == 'glyph' then length = length + 1 end
         else
            state = 'default'
         end
      else
         assert(false, string.format('Impossible state %s', state))
      end
      head = head.next
   end
   return root
end
luatexbase.add_to_callback('pre_linebreak_filter', no_punct_short_word_eol, 'Prevent short words after punctuation at end of sentence')

function no_bol_short_word_punct(head)
   -- Prevents having a line that starts like "<short_word><punctuation>"
   -- How we do this:
   --   (1) detect such short words (space, short_word, punct)
   --   (2) insert a penalty of 10000 between the space and the following short_word.
   -- More concretely:
   --   * A punctuation is one of .?!:;, which are the ones affected by \frenchspacing
   --   * A space is any glue node.
   --   * A short_word is a sequence of only glyph and kern nodes.
   -- So we maintain a state machine: default -> seen_space -> seen_word
   -- where in the last state we maintain length. If we're in seen_word state and we see
   -- a punct, and length is less than threshold, insert a penalty before the glue.
   -- Note that for this to work, we need to maintain a pointer to where we saw the glue.
   state = 'default'
   root = head
   before_space = nil
   while head do
      if state == 'default' then
         if node.type(head.id) == 'glue' then
            state = 'seen_space'
            before_space = head.prev
         end
      elseif state == 'seen_space' then
         if node.type(head.id) == 'glyph' then
            state = 'seen_word'
            length = 1
         else
            state = 'default'
         end
      elseif state == 'seen_word' then
         if is_punct(head) and length <= 2 then
            -- Moment of truth
            penalty = node.new('penalty')
            penalty.penalty = 10000
            root, new = node.insert_after(root, before_space, penalty)
            -- TODO: Is 'head' invalidated now? Docs don't say anything...
            state = 'default'
         elseif node.type(head.id) == 'glyph' or node.type(head.id) == 'kern' then
            if node.type(head.id) == 'glyph' then length = length + 1 end
         elseif node.type(head.id) == 'glue' then
            state = 'seen_space'
            before_space = head.prev
         else
            state = 'default'
         end
      else
         assert(false, string.format('Impossible state %s', state))
      end
      head = head.next
   end
   return root
end
luatexbase.add_to_callback('pre_linebreak_filter', no_bol_short_word_punct, 'Prevent short words at beginning of sentence before punctuation')

Answer 1