Как избежать переносов строк, приводящих к появлению коротких слов по краям строк?

Question

Здесь есть две цели:

не прерывайте после короткого слова, которое следует сразу за знаком препинания,
не прерывайте перед коротким словом, которое непосредственно предшествует знаку препинания,

при условии соблюдения регулярных ограничений по хорошему разрыву линии.

Одним из простых решений является объявление знаков препинания особенно хорошими местами для разрыва (отрицательный штраф, достаточно большой по величине). Это позволит TeX сбалансировать попытки разрыва на знаках препинания с другими соображениями разрыва строк (плохость, недостатки, другие штрафы), но не гарантирует, что разрывов такого рода не будет вообще.

Вот для наглядности:

Как вы видете,

В первом абзаце , itпосле изменения знак в конце третьей строки переместился на следующую строку.
Во втором абзаце el.в начале четвертой строки и at,в начале шестой строки после изменения переместились на предыдущую строку.
Третий абзац был включен, чтобы показать, что этот трюк не является гарантией: it.в начале четвертой строки остается символ «a», потому что его просто невозможно втиснуть в предыдущую строку.

Это было достигнуто с помощью:

\catcode`.=\active \def.{\char`.\penalty -200\relax}
\catcode`,=\active \def,{\char`,\penalty -200\relax}

в следующем документе:

\documentclass{article}
\begin{document}
\frenchspacing % Makes it easier
\hsize=20em
\parskip=10pt

% First, three paragraphs with the default settings
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

\pagebreak

% Now the same text, with the meanings of . and , changed.
\catcode`.=\active \def.{\char`.\penalty -200\relax}
\catcode`,=\active \def,{\char`,\penalty -200\relax}

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

% Change it back
\catcode`.=12 \catcode`,=12
\pagebreak

% Same text again, to show that nothing's permanently changed.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

\end{document}

Примечания:

Я не удивлюсь, если изменение значений .и ,таким образом что-то сломает. (На самом деле я был удивлен, что в этом примере ничего не испортилось, но потом понял, что изменения catcode не применяются к токенам, которые уже были считаны.)
Вы можете настроить штрафы: я использовал -200 просто в качестве примера, но все от -1 до -9999 будет иметь штрафы.некоторыйэффект. (В этом примере пороговое значение для вступления в силу всех этих изменений, по-видимому, составляет -175, хотя одно изменение происходит даже при -100.) Штраф ≤ -10000 приводит к разрыву строки, а это не то, что вам нужно.
Вы можете сделать то же самое для большего количества знаков препинания ( ?!:;) или установить разные штрафы для разных знаков препинания.
Немного сложнее с \nonfrenchspacing(по умолчанию), где пробелы больше после знаков препинания. Это может быть выполнимо, но придумать эти примеры было большой работой, поэтому я не стал этим заниматься. Оставлено в качестве упражнения :-)
С LuaTeX вы даже можете изменить алгоритм переноса строк, что было бы классным способомгарантияникаких коротких слов по краям строк (если это то, что вам нужно).

Редактировать: Я не мог устоять перед реализацией «гарантированного» решения в LuaTeX. Эта версия должна работать как с , так \frenchspacingи \nonfrenchspacing. Она обнаруживает определенные последовательности и вставляет бесконечные (10000) штрафы, чтобы предотвратить разрыв:

(punct, space, short_word, space) -> (punct, space, short_word, penalty, space)

и

(space, short_word, punct) -> (penalty, space, short_word, punct)

Для приведенного выше примера это дает:

Обратите внимание на переполненный блок в последнем абзаце, поскольку ограничения довольно строгие, но именно этого мы и просили. (В любом случае, у вас, вероятно, не будет переполненных блоков с более широкими и длинными абзацами, и вы можете исправить это обычными способами, переписав или добавив \emergencystretchи т. д.)

Код, который создал вышеприведенное (и даже идею), вполне возможно, содержит ошибки, которые могут даже привести к сбою вашей компиляции LuaTeX, но вот он:

\documentclass{article}
\directlua{dofile("strict.lua")}
\begin{document}
\frenchspacing % Keeping same example as before
\hsize=20em
\parskip=10pt

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.
\end{document}

где strict.lua:

function is_punct(n)
   if node.type(n.id) ~= 'glyph' then return false end
   if n.char > 127 then return false end
   c = string.char(n.char)
   if c == '.' or c =='?' or c == '!' or c == ':' or c == ';' or c == ',' then
      return true
   end
   return false
end

function no_punct_short_word_eol(head)
   -- Prevents having a line that ends like "<punctuation><space><short_word>"
   -- How we do this:
   --   (1) detect such short words (punct, space, short_word, space)
   --   (2) insert a penalty of 10000 between the short_word and the following space.
   -- More concretely:
   --   * A punctuation is one of .?!:;, which are the ones affected by \frenchspacing
   --   * A space is any glue node.
   --   * A short_word is a sequence of only glyph and kern nodes.
   -- So we maintain a state machine: default -> seen_punct -> seen_space -> seen_word
   -- where in the last state we maintain length. If we're in seen_word state and we see
   -- a glue, and length is less than threshold, insert a penalty before the glue.
   state = 'default'
   root = head
   while head do
      if state == 'default' then
         if is_punct(head) then
            state = 'seen_punct'
         end
      elseif state == 'seen_punct' then
         if node.type(head.id) == 'glue' then
            state = 'seen_space'
         else
            state = 'default'
         end
      elseif state == 'seen_space' then
         if node.type(head.id) == 'glyph' then
            state = 'seen_word'
            length = 1
         elseif is_punct(head) then
            state = 'seen_punct'
         else
            state = 'default'
         end
      elseif state == 'seen_word' then
         if node.type(head.id) == 'glue' and length <= 2 then
            -- Moment of truth
            penalty = node.new('penalty')
            penalty.penalty = 10000
            root, new = node.insert_before(root, head, penalty)
            -- TODO: Is 'head' invalidated now? Docs don't say anything...
            state = 'default'
         elseif node.type(head.id) == 'glyph' or node.type(head.id) == 'kern' then
            if node.type(head.id) == 'glyph' then length = length + 1 end
         else
            state = 'default'
         end
      else
         assert(false, string.format('Impossible state %s', state))
      end
      head = head.next
   end
   return root
end
luatexbase.add_to_callback('pre_linebreak_filter', no_punct_short_word_eol, 'Prevent short words after punctuation at end of sentence')

function no_bol_short_word_punct(head)
   -- Prevents having a line that starts like "<short_word><punctuation>"
   -- How we do this:
   --   (1) detect such short words (space, short_word, punct)
   --   (2) insert a penalty of 10000 between the space and the following short_word.
   -- More concretely:
   --   * A punctuation is one of .?!:;, which are the ones affected by \frenchspacing
   --   * A space is any glue node.
   --   * A short_word is a sequence of only glyph and kern nodes.
   -- So we maintain a state machine: default -> seen_space -> seen_word
   -- where in the last state we maintain length. If we're in seen_word state and we see
   -- a punct, and length is less than threshold, insert a penalty before the glue.
   -- Note that for this to work, we need to maintain a pointer to where we saw the glue.
   state = 'default'
   root = head
   before_space = nil
   while head do
      if state == 'default' then
         if node.type(head.id) == 'glue' then
            state = 'seen_space'
            before_space = head.prev
         end
      elseif state == 'seen_space' then
         if node.type(head.id) == 'glyph' then
            state = 'seen_word'
            length = 1
         else
            state = 'default'
         end
      elseif state == 'seen_word' then
         if is_punct(head) and length <= 2 then
            -- Moment of truth
            penalty = node.new('penalty')
            penalty.penalty = 10000
            root, new = node.insert_after(root, before_space, penalty)
            -- TODO: Is 'head' invalidated now? Docs don't say anything...
            state = 'default'
         elseif node.type(head.id) == 'glyph' or node.type(head.id) == 'kern' then
            if node.type(head.id) == 'glyph' then length = length + 1 end
         elseif node.type(head.id) == 'glue' then
            state = 'seen_space'
            before_space = head.prev
         else
            state = 'default'
         end
      else
         assert(false, string.format('Impossible state %s', state))
      end
      head = head.next
   end
   return root
end
luatexbase.add_to_callback('pre_linebreak_filter', no_bol_short_word_punct, 'Prevent short words at beginning of sentence before punctuation')

Answer 1