固有名詞の頭文字に自動的に改行しないスペースを挿入する

Question 1

パッケージにイニシャルのサポートを追加しましたルアヴルナこのパッケージは、luatex1 文字の単語と頭文字の後に、言語に依存した改行なしスペースを挿入するためのノード処理コールバックを使用します。

例：

\documentclass{article}
\usepackage{fontspec}
\usepackage[czech, english]{babel}
\usepackage{luavlna}
\preventsinglelang{czech}
\begin{document}
  \preventsingledebugon
D. E. Knuth, Ch. Somebody. \selectlanguage{czech} A. Dvořák, 
name in horizontal box \hbox{Č. Zíbrt}, Ř. Jelen \preventsingleoff C. Někdo, 
\preventsingleon Ř. Jelen, Ch. Josef, CH. Thisworkstoo

\end{document}

ここに画像の説明を入力してください

チェコ語ではが 1 つの文字として使用されていることがわかりますCh。言語に依存した処理を望まない場合は、でデフォルトの言語を設定すると\preventsinglelang{languagename}、指定された言語のルールがドキュメント全体で使用されます。

処理を中止するには\preventsingleoff、\preventsingleon

luavlnaTEXMFHOMEまだ CTAN には載っていませんが、言語検出をより堅牢にする必要があります。使用したい場合は、github からダウンロードしてローカルディレクトリにインストールできます。

Answer

パッケージにイニシャルのサポートを追加しましたルアヴルナこのパッケージは、luatex1 文字の単語と頭文字の後に、言語に依存した改行なしスペースを挿入するためのノード処理コールバックを使用します。

例：

\documentclass{article}
\usepackage{fontspec}
\usepackage[czech, english]{babel}
\usepackage{luavlna}
\preventsinglelang{czech}
\begin{document}
  \preventsingledebugon
D. E. Knuth, Ch. Somebody. \selectlanguage{czech} A. Dvořák, 
name in horizontal box \hbox{Č. Zíbrt}, Ř. Jelen \preventsingleoff C. Někdo, 
\preventsingleon Ř. Jelen, Ch. Josef, CH. Thisworkstoo

\end{document}

ここに画像の説明を入力してください

チェコ語ではが 1 つの文字として使用されていることがわかりますCh。言語に依存した処理を望まない場合は、でデフォルトの言語を設定すると\preventsinglelang{languagename}、指定された言語のルールがドキュメント全体で使用されます。

処理を中止するには\preventsingleoff、\preventsingleon

luavlnaTEXMFHOMEまだ CTAN には載っていませんが、言語検出をより堅牢にする必要があります。使用したい場合は、github からダウンロードしてローカルディレクトリにインストールできます。

Question 2

より多くのケースを捕捉するために編集しました（そして故障モードを調べました）

私は、スペースをアクティブにしないという David の推奨に同意します。したがって、ここで私が採用するアプローチは、ドットをアクティブにして、それが他の何かに干渉することがわかった場合は、必要に応じて機能をオン ( \initialsON) およびオフ ( ) にできるようにすることです。\initialsOFF

アクティブドットを選択すると、アクティブドットが発生するという厳しい制限が伴うことがわかります。後頭文字が付けられているため、アクティブなドットの前に頭文字があったかどうかを明確に知ることは不可能になります。しかし、それでもその目標に向けて興味深い進歩を遂げることができます。

私のオリジナルのソリューション ( に続くパス\specdothelper) では、などのスペースなしで入力された頭文字に対してのみ検出スキームがトリガーされ、との間にスペースがない場合、J.W.Bushシーケンス.Xがに変換されました( は任意の大文字を表します)。この圧縮された構文は、慣れるのに少し時間がかかるか、多くのユーザーにはまったく受け入れられないかもしれません。 .~X.XX

この最新の改訂版（そして私が初めて成功した使用）により\futurelet、やったー. X.) では、最初のドットの後にスペースがあり、大文字の後にドットが続く構文も検索できるようになりました。このシーケンスが見つかった場合は、に変換されます.~X. 。また、イニシャルが最近見つかった場合に限り、シーケンスで. Xxは姓が見つかったと想定され、に変換されます.~Xx。したがって、のようなシーケンスではJ. Z. A. Bush、3 つのスペースがすべて捕捉され、ハードスペースに変換されます。

しかし、ドットをアクティブ文字として使用することで、ドットの前の文字が頭文字であるかどうかを知ることができず、入力ストリームを前方に調べることでそれを判別することしかできません。で示されている例ではG. Washington、最初のドットの問題は、頭文字のパターンがまだ確立されていないため、前方に見ている. Wa文字セットが頭文字に続く名前ではなく、文の始まりである可能性があることです。そのため、この重要なケースが見逃されています。

この編集では、ロジックの核心に関する以降の議論を縮小しました。要約すると、課題領域はスペース、\pars、および繰り返しドット..(ドットはアクティブ文字であるため) であり、特別な処理が必要であったとだけ述べます。また、この改訂版の新しいロジックはマクロに従います\foundspace。

圧縮された (スペースなしの) 構文が失敗する場所は、私の MWE に示されています。姓が大文字で始まらない場合が 1 つの例です (C.deLune など)。また、URL にドットの後に大文字が続く場合、スペースが挿入されます (ただし、\initialsOFFURL のような非標準的なテキストを設定する前に使用する必要があります)。

LaTeX ファイルでイニシャルの間にスペースが残るように修正された構文を使用する場合、3 つの既知の失敗モードがあります (そのうちの 1 つは重要です)。1 つ目は、上でかなり詳しく説明したように、1 つのイニシャルの後に姓が続く場合です。2 つ目の失敗は、文がイニシャルで始まる場合です (ただし、これは文法的に正しくありません)。3 つ目の失敗は、文がイニシャルのように見えるもので終わる場合です。たとえばU. S. A. 、次の文の最初の単語が姓と認識され、ハードスペースが挿入されます。

圧縮構文を使用する場合、文書にスペースなしで頭文字を入力するという要件は、多くのユーザーにとってまったく受け入れられない可能性があります。また、通常の拡張構文では、シーケンス内の単独の頭文字を検出できないため、このアプローチの有用性が大幅に制限されます。

以下の MWE では、ハードスペースを意図的にとして定義して、表示できる\HSよう\ruleにしています。このコードを意図したとおりに使用するには、その定義を、ハードスペース (またはスキニースペース) として定義する定義に置き換える必要があります。

\documentclass{article}
\usepackage{ifnextok}
\def\HS{\rule{.66ex}{1ex}}% TO DEMONSTRATE WHERE ACTIVE
%\def\HS{\,}% FOR NARROW SPACE
%\def\HS{~}% FOR NORMAL HARD SPACE
\let\svdot.
\def\knowninit{F}
\makeatletter
\def\specdot{\svdot\IfNextToken\@sptoken{\foundspace}%
  {\gdef\knowninit{F}\specdothelper}}
\long\def\foundspace#1{\IfNextToken\@sptoken{ #1\gdef\knowninit{F}}%
  {\def\savefirst{#1}\lookatsecond}}
\def\lookatsecond{\futurelet\secondchar\processsecond}

\long\def\specdothelper#1{%
  \if\svdot#1%
    \svdot%
  \else%
    \ifx#1\par%
      \par%
    \else%
      \ifnum`#1>`@\ifnum`#1<`[\HS\fi\fi#1%
    \fi%
  \fi%
}
\makeatother

\catcode`.=\active
\def\processsecond{%
  \ifx\secondchar.%
    \HS\gdef\knowninit{T}%
  \else%
    \if T\knowninit%
      \ifnum\expandafter`\savefirst>`@\ifnum\expandafter`\savefirst<`[\HS\else%
        { }\fi\else{ }\fi%
    \else%
      { }%
    \fi%
    \gdef\knowninit{F}%
  \fi%
  \savefirst%
}
\def\initialsON{\catcode`.=\active\def.{\specdot}}
\def\initialsOFF{\catcode`.=12\let.\svdot}
\catcode`.=12
\parskip 1ex
\begin{document}
\footnotesize
\noindent ON\initialsON

Fully spaced initials J. Z. A. Bush being tested, and here we check double and single initials: 
J. Q. Adams and G. Washington.  A single initial cannot be discerned because the dot after 
the G cannot know if the prior letter is an initial and no other initials follow the dot.

U. S. A. is OK, since ``is'' is not capitalized.

We can be fooled by U. S. Olympic Team, in that it considers ``Olympic'' to be the last name.
Can also be fooled if sentence ends in the U. S. A. The new sentence starts with a hard-space, 
with ``The'' as the last name.  Leaving out the spaces will fix U.S.A.  If no spaces are wanted,
\initialsOFF U.S.A. \initialsON 
can be gotten by temporarily turning initials OFF.

Compressed or uncompressed C.deLune and  C. de Lune fail to insert a hard space, because 
``d'' is not a capital letter.

Unspaced combinations: 3.2, a.b, J.Z.A.Bush, and G.Washington being successfully tested
 here, with non-capital letters screened out.

Testing.. successive... dots is OK... Unless the sentence ends with odd number of dots, then 
a space immediately followed by a dotted initial... S. Segletes would never start a sentence 
with an initial.  It is poor grammar in the first place. 

\noindent\hrulefill\\
OFF\initialsOFF (This was the raw text being processed)

Fully spaced initials J. Z. A. Bush being tested, and here we check double and single initials: 
J. Q. Adams and G. Washington.  A single initial cannot be discerned because the dot after 
the G cannot know if the prior letter is an initial and no other initials follow the dot.

U. S. A. is OK, since ``is'' is not capitalized.

We can be fooled by U. S. Olympic Team, in that it considers ``Olympic'' to be the last name.
Can also be fooled if sentence ends in the U. S. A. The new sentence starts with a hard-space, 
with ``The'' as the last name.  Leaving out the spaces will fix U.S.A.  If no spaces are wanted,
U.S.A. 
can be gotten by temporarily turning initials OFF.

Compressed or uncompressed C.deLune and  C. de Lune fail to insert a hard space, because 
``d'' is not a capital letter.

Unspaced combinations: 3.2, a.b, J.Z.A.Bush, and G.Washington being successfully tested
 here, with non-capital letters screened out.

Testing.. successive... dots is OK... Unless the sentence ends with odd number of dots, then 
a space immediately followed by a dotted initial... S. Segletes would never start a sentence 
with an initial.  It is poor grammar in the first place. 
\end{document}

ここに画像の説明を入力してください

Answer