¿Cómo evitar saltos de línea que dan lugar a palabras cortas en los bordes de las líneas?

Question

Hay dos objetivos aquí:

no se interrumpa después de una palabra corta que sigue inmediatamente a la puntuación,
no se detenga ante una palabra corta que precede inmediatamente a la puntuación,

sujeto a restricciones regulares de buen salto de línea.

Una solución simple es declarar la puntuación como lugares particularmente buenos para romper (una penalización negativa, de magnitud suficientemente grande). Esto permitirá a TeX compensar el intento de romper la puntuación con sus otras consideraciones de salto de línea (maldad, deméritos, otras penalizaciones), pero no garantizará que no haya absolutamente ninguna ruptura de ese tipo.

Aquí hay un antes y un después, para ilustrar:

Como se puede ver,

En el primer párrafo, , ital final de la tercera línea se ha movido a la siguiente línea después del cambio.
En el segundo párrafo, el.al comienzo de la cuarta línea y at,al comienzo de la sexta línea se han movido a la línea anterior después del cambio.
El tercer párrafo se ha incluido para mostrar que este truco no es una garantía: el it.principio de la cuarta línea permanece allí, porque simplemente no hay forma de encajarlo en la línea anterior.

Esto se logró con:

\catcode`.=\active \def.{\char`.\penalty -200\relax}
\catcode`,=\active \def,{\char`,\penalty -200\relax}

en el siguiente documento:

\documentclass{article}
\begin{document}
\frenchspacing % Makes it easier
\hsize=20em
\parskip=10pt

% First, three paragraphs with the default settings
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

\pagebreak

% Now the same text, with the meanings of . and , changed.
\catcode`.=\active \def.{\char`.\penalty -200\relax}
\catcode`,=\active \def,{\char`,\penalty -200\relax}

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

% Change it back
\catcode`.=12 \catcode`,=12
\pagebreak

% Same text again, to show that nothing's permanently changed.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.

\end{document}

Notas:

No me sorprendería que cambiar el significado de .y ,de esta manera rompiera algo. (De hecho, me sorprendió que nada se estropeara en este ejemplo, luego me di cuenta de que los cambios de catcode no se aplican a los tokens que ya se han leído).
Puedes modificar las penalizaciones: utilicé -200 solo como ejemplo, pero cualquier valor entre -1 y -9999 tendráalgunoefecto. (En este ejemplo, el umbral para que todos estos cambios surtan efecto parece ser -175, aunque un cambio ocurre incluso en -100). Una penalización ≤ -10000 fuerza un salto de línea, que no es lo que desea.
Puede hacer lo mismo con más caracteres de puntuación ( ?!:;) o tener penalizaciones diferentes para diferentes caracteres de puntuación.
Las cosas son un poco más difíciles con \nonfrenchspacing(el valor predeterminado), donde los espacios son más grandes después de la puntuación. Puede que sea factible, pero crear estos ejemplos requirió mucho trabajo, por lo que no lo he seguido. Dejado como ejercicio :-)
Con LuaTeX puedes incluso cambiar el algoritmo de salto de línea, lo que sería una forma genial degarantizarno hay palabras cortas en los bordes de las líneas (si eso es lo que necesita).

Editar: No pude resistirme a implementar la solución "garantizada" en LuaTeX. Esta versión debería funcionar con ambos \frenchspacingy \nonfrenchspacing. Lo que hace es detectar ciertas secuencias e insertar penalizaciones infinitas (10000) para evitar una ruptura:

(punct, space, short_word, space) -> (punct, space, short_word, penalty, space)

y

(space, short_word, punct) -> (penalty, space, short_word, punct)

Para el ejemplo anterior, esto produce:

Tenga en cuenta el cuadro demasiado lleno en el último párrafo porque las restricciones son bastante estrictas, pero eso es lo que pedimos. (De todos modos, probablemente no tendrá cuadros demasiado llenos con párrafos más anchos y más largos, y puede arreglarlos de la manera habitual, reescribiéndolos o agregando, \emergencystretchetc.).

El código que produjo lo anterior (e incluso la idea) posiblemente tenga errores que incluso pueden provocar que la compilación de LuaTeX falle, pero aquí está:

\documentclass{article}
\directlua{dofile("strict.lua")}
\begin{document}
\frenchspacing % Keeping same example as before
\hsize=20em
\parskip=10pt

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut blandit placerat justo, sed dictum sem. Donec erat elit, tincidunt non, it vel, tincidunt vehicula velit. Etiam pharetra ante at porta elementum. In nulla purus, faucibus non accumsan non, consequat eget.

Natis nulla blandit luctus tellus, sit amet posuere lacus maxius quis. In sit amet mattis est, a vehiula velit. Nam interum solicitudin el. In faucibus vulputate purus nec consectelur crass metus ipsum, blandit iln ullamcorpert at, portitor vita dolor. Duis sed mauris i inset inculis malesuada. Quisque laoret eu dui eget sage melittis corpum verborum.

Volutpat libero ac auctor. Donec semper, as id ultrices rhoncus, lectus nulla consequat nisi, ac sagitis risus lectus vel felis. Ut gravida it. Nam malesuada ante turpis eget. Ipsum factum verbum verdit.
\end{document}

dónde strict.luaes:

function is_punct(n)
   if node.type(n.id) ~= 'glyph' then return false end
   if n.char > 127 then return false end
   c = string.char(n.char)
   if c == '.' or c =='?' or c == '!' or c == ':' or c == ';' or c == ',' then
      return true
   end
   return false
end

function no_punct_short_word_eol(head)
   -- Prevents having a line that ends like "<punctuation><space><short_word>"
   -- How we do this:
   --   (1) detect such short words (punct, space, short_word, space)
   --   (2) insert a penalty of 10000 between the short_word and the following space.
   -- More concretely:
   --   * A punctuation is one of .?!:;, which are the ones affected by \frenchspacing
   --   * A space is any glue node.
   --   * A short_word is a sequence of only glyph and kern nodes.
   -- So we maintain a state machine: default -> seen_punct -> seen_space -> seen_word
   -- where in the last state we maintain length. If we're in seen_word state and we see
   -- a glue, and length is less than threshold, insert a penalty before the glue.
   state = 'default'
   root = head
   while head do
      if state == 'default' then
         if is_punct(head) then
            state = 'seen_punct'
         end
      elseif state == 'seen_punct' then
         if node.type(head.id) == 'glue' then
            state = 'seen_space'
         else
            state = 'default'
         end
      elseif state == 'seen_space' then
         if node.type(head.id) == 'glyph' then
            state = 'seen_word'
            length = 1
         elseif is_punct(head) then
            state = 'seen_punct'
         else
            state = 'default'
         end
      elseif state == 'seen_word' then
         if node.type(head.id) == 'glue' and length <= 2 then
            -- Moment of truth
            penalty = node.new('penalty')
            penalty.penalty = 10000
            root, new = node.insert_before(root, head, penalty)
            -- TODO: Is 'head' invalidated now? Docs don't say anything...
            state = 'default'
         elseif node.type(head.id) == 'glyph' or node.type(head.id) == 'kern' then
            if node.type(head.id) == 'glyph' then length = length + 1 end
         else
            state = 'default'
         end
      else
         assert(false, string.format('Impossible state %s', state))
      end
      head = head.next
   end
   return root
end
luatexbase.add_to_callback('pre_linebreak_filter', no_punct_short_word_eol, 'Prevent short words after punctuation at end of sentence')

function no_bol_short_word_punct(head)
   -- Prevents having a line that starts like "<short_word><punctuation>"
   -- How we do this:
   --   (1) detect such short words (space, short_word, punct)
   --   (2) insert a penalty of 10000 between the space and the following short_word.
   -- More concretely:
   --   * A punctuation is one of .?!:;, which are the ones affected by \frenchspacing
   --   * A space is any glue node.
   --   * A short_word is a sequence of only glyph and kern nodes.
   -- So we maintain a state machine: default -> seen_space -> seen_word
   -- where in the last state we maintain length. If we're in seen_word state and we see
   -- a punct, and length is less than threshold, insert a penalty before the glue.
   -- Note that for this to work, we need to maintain a pointer to where we saw the glue.
   state = 'default'
   root = head
   before_space = nil
   while head do
      if state == 'default' then
         if node.type(head.id) == 'glue' then
            state = 'seen_space'
            before_space = head.prev
         end
      elseif state == 'seen_space' then
         if node.type(head.id) == 'glyph' then
            state = 'seen_word'
            length = 1
         else
            state = 'default'
         end
      elseif state == 'seen_word' then
         if is_punct(head) and length <= 2 then
            -- Moment of truth
            penalty = node.new('penalty')
            penalty.penalty = 10000
            root, new = node.insert_after(root, before_space, penalty)
            -- TODO: Is 'head' invalidated now? Docs don't say anything...
            state = 'default'
         elseif node.type(head.id) == 'glyph' or node.type(head.id) == 'kern' then
            if node.type(head.id) == 'glyph' then length = length + 1 end
         elseif node.type(head.id) == 'glue' then
            state = 'seen_space'
            before_space = head.prev
         else
            state = 'default'
         end
      else
         assert(false, string.format('Impossible state %s', state))
      end
      head = head.next
   end
   return root
end
luatexbase.add_to_callback('pre_linebreak_filter', no_bol_short_word_punct, 'Prevent short words at beginning of sentence before punctuation')

Answer 1