LaTeX コードを最小限の HTML に変換するにはどうすればよいですか?

Question

正直に言うと、あなたが達成したいことはあまり有用ではないと思います。追加の HTML タグと属性には、CSS スタイルなどに使用できる有用なセマンティック情報が含まれています。

たとえばこのコード:

<h3 class='sectionHead'><span class='titlemark'>1.1   </span> <a id='x2-20001.1'></a>Nam amet</h3>
<!-- l. 12 --><p class='noindent'>Adipiscing est leo convallis nunc interdum Lorem hendrerit Vestibulum amet.
</p>

<h3 class='sectionHead'>は、このタイトルがコマンドによって生成されたことを意味し\section、セクション番号の特別な書式設定に使用できます。は、このセクションを指すコマンド<a id='x2-20001.1'></a>からのリンク先、および TOC からのリンク先です\ref。このタグを削除すると、相互参照が機能しなくなります。は、元の TeX ファイルの行番号です。これはデバッグには役立ちますが、他のタグほど便利ではないことに同意します。は、この段落が元のドキュメントでは意図されていなかったことを意味します。HTML ファイルは、余分な情報を気にしないマシンで消費されることを目的としているため、タグを削除しても何も得られず、かなり多くのものを失います。

そうは言っても、本当にこの情報をすべて削除したい場合は、削除できます。方法は 2 つあります。1 つは、TeX4th 構成ファイルを使用して生成されたタグを変更すること、もう 1 つは、LuaXML DOM フィルターを使用してプログラムでタグを削除することです。また、これらのアプローチを組み合わせて、より簡単な部分に構成ファイルを使用し、TeX 側から削除するのが難しい残りの要素を削除するためにビルドファイルを使用することもできます。

特定の例は、構成ファイルのみを使用して解決できます。次のコードを次のように保存しますmycfg.cfg。

\Preamble{xhtml}
\def\blocktag#1{\ifvmode\IgnorePar\fi\EndP\HCode{#1}}
\Configure{chapter}{}{}{\blocktag{<h2>}\chaptername\ \TitleMark\HCode{<br />\Hnewline}}{\blocktag{</h2>}}
\Configure{section}{}{}{\blocktag{<h3>}\TitleMark}{\blocktag{</h3>}}
\Configure{subsection}{}{}{\blocktag{<h4>}\TitleMark}{\blocktag{</h4>}}
\Configure{subsubsection}{}{}{\blocktag{<h5>}\TitleMark}{\blocktag{</h5>}}
\ConfigureMark{chapter}{\thechapter}
\ConfigureMark{section}{\thesection\ }
\ConfigureMark{subsection}{\thesubsection\ }
% subsubsection doesn't need mark configuration, as it doesn't produce a number
% handle paragraphs
\Configure{HtmlPar}{\EndP\HCode{<p>}}{\EndP\HCode{<p>}}{\HCode{</p>}}{\HCode{</p>}}
\Configure{textbf}{\HCode{<b>}\NoFonts}{\EndNoFonts\HCode{</b>}}
\Configure{textit}{\HCode{<i>}\NoFonts}{\EndNoFonts\HCode{</i>}}
\Configure{emph}{\HCode{<em>}\NoFonts}{\EndNoFonts\HCode{</em>}}
% handle the <a> tag inside sections

\catcode`\:=11

\def\Title:Link#1#2{}
\def\EndTitle:Link#1{}
% uncomment the following lines to get correct cross-references
%\LinkCommand\SectionLink{span,\noexpand\:gobble,id}
%\def\Title:Link{\SectionLink}
%\def\EndTitle:Link#1{\EndSectionLink}
\catcode`\:=12


\begin{document}
\EndPreamble

セクションタイトルを処理するには、セクションタイプごとに 2 つの構成コマンドを提供する必要があります。

 \Configure{sectionname}{at start of section}{at end of section}{section title}{end section title}
 \ConfigureMark{sectionname}{code that prints section number}

したがって、セクションを構成するには、以下を使用する必要があります。

\Configure{section}{}{}{\blocktag{<h3>}\TitleMark}{\blocktag{</h3>}}
\ConfigureMark{section}{\thesection\ }

これにより、TeX4ht によって生成された不要な書式設定がすべて削除されます。

次に段落を修正します。

\Configure{HtmlPar}{\EndP\HCode{<p>}}{\EndP\HCode{<p>}}{\HCode{</p>}}{\HCode{</p>}}

これにより、行番号とインデントに関する情報を含むコメントが削除されます。この\EndPコマンドは、前の段落の終了タグを挿入します。

\textbfまた、次のような同様のコマンドに対して、より適切な書式も提供しました。

\Configure{textbf}{\HCode{<b>}\NoFonts}{\EndNoFonts\HCode{</b>}}

この\NoFontsコマンドは、などの挿入を防止します。これらのタグは、デフォルト以外のフォントを使用するたびに挿入されます。\NoFontsはそれを防ぎます。\EndNoFonts再度オンにするには、を使用する必要があります。フォント情報をまったく使用しない場合は、次のようにコマンドNoFontsにオプションを追加して無効にすることができます\Preamble。

 \Preamble{xhtml,NoFonts}

最後の部分は最も議論を呼ぶ部分です。<a>セクションタイトルの要素はコマンドを使用して挿入されます\Title:Link。これを再定義してリンクを破棄することができます。名前にが使用されているため、この文字も:変更する必要があります。\catcode

\catcode`\:=11
\def\Title:Link#1#2{}
\def\EndTitle:Link#1{}
\catcode`\:=12

この設定では、次のような結果が得られます。

tex4ebook -c mycfg.cfg sample.tex

 <h2>Chapter 1<br /> 
Lorem ipsum</h2>
<p>   Dolor sit amet consectetuer eros sit quis mauris pretium. Phasellus penatibus
interdum dolor Ut nisl.
   </p>
   <h3>1.1 Nam amet</h3>
<p>   Adipiscing est leo convallis nunc interdum Lorem hendrerit Vestibulum
amet.
</p><p>   Facilisi Nulla ultrices malesuada orci nibh eget ac Aliquam eros ut.
</p><p>
   </p>
   <h3>1.2 Lorem gravida</h3>
<p>   Oorci sociis Nunc id hendrerit at ac amet Pellentesque. Eleifend risus orci sem
Sed ac.
</p><p>   A nec pellentesque Pellentesque Morbi fringilla accumsan et metus at
enim.
</p><p>   Eu felis Curabitur quis nibh tellus.
   </p>

相互参照と目次を正しく動作させたい場合は、`\Title:Link: に次の設定を使用することをお勧めします。

\LinkCommand\SectionLink{span,\noexpand\:gobble,id}
\def\Title:Link{\SectionLink}
\def\EndTitle:Link#1{\EndSectionLink}

は\LinkCommand、リンクを生成するために TeX4ht 相互参照メカニズムを使用する新しいコマンドを定義します。<a>要素の代わりに、このバージョンはを生成し、\noexpand\:gobble可能な出力リンクを削除し、idセクションを指すリンクの宛先を保持します。

この変更により、次の結果が得られます。

  <h2 id='lorem-ipsum'>Chapter 1<br /> 
<span id='x2-10001'>Lorem ipsum</span></h2>
<p>   Dolor sit amet consectetuer eros sit quis mauris pretium. Phasellus penatibus
interdum dolor Ut nisl.
   </p>
   <h3 id='nam-amet'>1.1 <span id='x2-20001.1'>Nam amet</span></h3>
<p>   Adipiscing est leo convallis nunc interdum Lorem hendrerit Vestibulum
amet.
</p><p>   Facilisi Nulla ultrices malesuada orci nibh eget ac Aliquam eros ut.
</p><p>
   </p>
   <h3 id='lorem-gravida'>1.2 <span id='x2-30001.2'>Lorem gravida</span></h3>
<p>   Oorci sociis Nunc id hendrerit at ac amet Pellentesque. Eleifend risus orci sem
Sed ac.
</p><p>   A nec pellentesque Pellentesque Morbi fringilla accumsan et metus at
enim.
</p><p>   Eu felis Curabitur quis nibh tellus.
   </p>

このセクションは次のようになっていることに注意してください。

  <h3 id='nam-amet'>1.1 <span id='x2-20001.1'>Nam amet</span></h3>

はNam amet変更された構成によって追加され、id='nam-amet'はによって追加されましたtex4ebook。これにより、変更される可能性が高いセクションの位置ではなく、セクションタイトルに基づいて安定したリンク先が提供されます。

段落内には余分な空白もありますが、これは DVI ファイルの空白から生成されます。これを取り除くには、DOM フィルターを使用します。

このタスクの単純な DOM フィルターは次のようになります。

local domfilter = require "make4ht-domfilter"

local function remove_space(node, regex)
  -- remove whitespace only from the text nodes
  if node and node:is_text() then
    node._text = node._text:gsub(regex, "")
  end
end

local filter = domfilter {
  function(dom)
    -- loop over <p> elements
    for _, p in ipairs(dom:query_selector("p")) do
      -- remove <p> elements without text
      local children = p:get_children()
      if #children < 2 and p:get_text():match("^%s*$") then
        p:remove_node()
      else
        local first = children[1]
        local last  = children[#children]
        remove_space(first, "^%s+") -- remove whitespace at the beginning
        remove_space(last, "%s+$") -- remove whitespace at the end of paragraph
      end
    end
    return dom
  end
}

Make:match("html$", filter)

要求するには、次のオプションを使用します-e。

$ tex4ebook -c mycfg.cfg -e build.lua sample.tex

結果は次のとおりです。

   <h2 id='lorem-ipsum'>Chapter 1<br /> 
<span id='x2-10001'>Lorem ipsum</span></h2>
<p>Dolor sit amet consectetuer eros sit quis mauris pretium. Phasellus penatibus
interdum dolor Ut nisl.</p>
   <h3 id='nam-amet'>1.1 <span id='x2-20001.1'>Nam amet</span></h3>
<p>Adipiscing est leo convallis nunc interdum Lorem hendrerit Vestibulum
amet.</p><p>Facilisi Nulla ultrices malesuada orci nibh eget ac Aliquam eros ut.</p>
   <h3 id='lorem-gravida'>1.2 <span id='x2-30001.2'>Lorem gravida</span></h3>
<p>Oorci sociis Nunc id hendrerit at ac amet Pellentesque. Eleifend risus orci sem
Sed ac.</p><p>A nec pellentesque Pellentesque Morbi fringilla accumsan et metus at
enim.</p><p>Eu felis Curabitur quis nibh tellus.</p>

Answer 1