The generation of LaTeX is configurable. The mapping of each HTML tag to LaTeX commands can be specified. (This mapping can even be changed dynamically during the processing of the HTML file.) It is also possible to exclude certain parts from the HTML files from the generated LaTeX file, or to include LaTeX parts in HTML comment lines, which are ignored by HTML viewers. This makes it possible to maintain sources for both HTML and LaTeX in the same HTML files.
The program performs certain checking of the HTML files, in order to be able to generate correct LaTeX output, but this checking is not guaranteed to conform any HTML standard. At some places the checking might be more relaxed, while at other places more restrictive then HTML 2.0. So far, there is not much support for extensions beyond HTML 2.0.
The program does extensive checking of links between the different files. Because of this reason it can also be used as a link checking program, by giving it a single HTML file, and the option -c, or to change its name into chkhtml. In order to also check all referenced pages in the local directory (and its sub-directories), the option -s should be used as well.
Links to excluded HTML files (and other URL's) can either be reported as footnotes, or as a sorted bibliography in the LaTeX file.
Error messages are reported on the standard output file. The program can also generate an extensive cross-references file mentioning all the anchor tags.
The program can be either used to convert a single HTML
file into a LaTeX file, or a collection of related HTML
files into a single LaTeX file. These two modes of operation
will be described below.
Instead of adding the required LaTeX commands manually, it
is also possible to place them inside comments in the HTML
file. See below for a description of the
commands which are recognized by html2tex inside HTML
files.
This page can be taken as an example of this. Execute the
following command to get a LaTeX file of this page:
'html2tex html2tex.html'. After this
the file html2tex.tex
can be processed and made, for example, into a PDF file:
html2tex.pdf.
When html2tex is executed with a skeleton file on the
command line, a LaTeX file with the same name as the skeleton
file, but with the extension .tex added to it, will
be created.
A real life example of a skeleton file is
transcoop, which includes pages from the
original TransCoop pages,
which are gone now.
The LaTeX file
transcoop.tex was generated when the following
command was executed in the TransCoop home directory:
'html2tex transcoop'.
From this, the PostScript file transcoop.ps can
be produced with the help of latex and dvips.
They change the mapping of the tag-name HTML tag to the
given LaTeX formating commands. The strings LaTeX-open
and LaTeX-close are put around the text that is marked by
the HTML tag. (The string in LaTeX-close is generated
at the proper place, in case the closing tag is not obligatory in the
HTML syntax.) If the LaTeX command has
to include a double quote one should use two double quotes in the string.
If a real newline (the '\n' character) has to be included,
use '\nl' instead. (There is no LaTeX command starting with
this sequence, but there are many starting with '\n'.)
The options are used for some special kind of translating.
The following options are possible:
The pseudo HTML tags (which cannot occur in the HTML files) L1
to L9 specify what LaTeX commands should be generated for
which section level. The definition of these pseudo-tags is changed
by the command %html -s style for setting the
document style.
The default settings are the ones given below, using the
format to be used in the input file:
Converting a single HTML file
If the program is executed with a single HTML file, a
LaTeX file will be generated. For example, the command
'html2tex test.html will generate the file test.tex
However, files generated in this manner, are not a complete LaTeX files.
To make them complete some LaTeX commands have to prefixed and
appended to the file. A LaTex file starts with commands to
specify the document style, the title page, and such.
Converting a collection of HTML files
To produce a single LaTeX file from a collection of linked
HTML files, a skeleton LaTeX file has to be provided.
In this skeleton there are commands embedded in comments
which specify which HTML files should be included at which
place.
The skeleton file
The skeleton input file should contain valid LaTeX commands.
In the file all lines starting with %html will be interpreted
as special lines by the conversion program. These are used to
indicate which HTML files should be included, and to set the
various options.
The following special commands are recognized by the html2tex:
Causes the the file fn.html to be included
as LaTeX at the given input line. The level
should be an integer to specify the indentation depth of the headers.
A value of 1 indicates that the file should be included on the
level using \section (or to \chapter for
the book document style).
Specifies the URL of the directory of the input file. This is
needed to detect if any given URL's in the HTML files map to
local HTML files. This command should be given before any
HTML file is included as LaTeX.
Indicate the style that should be used. By default the
book document style is used. Currently, the following
values for style are supported:
The command causes the mapping of the H1 to H7
tags to be set correctly for the given document style.
This command should be given before all commands to include
HTML files as LaTeX.
Causes LaTeX bibitems to be generated file for all excluded
HTML files (and other URL's), at the current location of the
skeleton file.
If this command is not given anywhere in the input file
(and also not the -b command line option), all
external URL's are given as footnotes.
Changes the mapping of the tag-name HTML tag to the
given LaTeX formating commands. See below for
a complete description.
To indicate that the from-URL is a (symbolic) link to
to-URL. To be used when there are two (or more)
URL's for the same physical file. The given URL's should be
relative to the root-URL.
To display a different URL then the one found in the HTML
files, if for example, one wants an ftp URL instead
of an http URL, or if one wants to reference the
original source, in case one has a local mirror of certain
files found at external URL's.
To indicate that the URL should be ignored.
To be used when there are additional HTML pages (for navigation
purposes) that you do not want to be referenced in the document.
The given URL should be an relative URL to the root-URL.
Setting various LaTeX generation options.
The various options are explained below.
Special command in the HTML files
The following special commands (inside HTML comments) are recognized
in the HTML files:
The program recognizes comments inside a pair of double dashes (--),
in any of the HTML tags including <! >. It also recognizes
any text in a <! > tag not surrounded by double dashes
as comment, but not without generating a warning message for it.
Causes the latex-commands to be copied to the LaTeX output
file. Use '&', '<',
''>', and '‐'
for the characters '&',
'<', '>', and '-'respectively.
Causes the HTML text and tags to be omitted from the generated
LaTeX files. These special commands are recognized as tags and
should be placed at the proper places with respect to the
other tags. They can be nested.
latex-on may be followed by additional commands which
are copied into the LaTeX file just like latex command
described above.
Changes the mapping of the tag-name HTML tag to the
given LaTeX formating commands. Follows the same rules
as the special command '%html -d' in the input file,
except that '&', '<',
'>', and '‐' should be used
for the characters '&',
'<', '>', and '-' respectively.
See below for a detailed description.
Causes the latex-commands to be copied to the LaTeX output
file, just like 'latex latex-commands',
but if it occurs inside a normal HTML tag, it replaces the
LaTeX output that would otherwise have been generated.
Causes the LaTeX generation option option-name
to be set to the value of option-value. The
various options are explained below.
The given format string is used for the generation of the next
reference. (This is an experimental feature which has not been
fully tested.)
With this command the document style to be used is specified.
By default the book document style is used. Currently, the following
values for document-style are supported:
The command causes the mapping of the H1 to H7
tags to be set correctly for the given document style.
This command should appear at the start of the HTML file, and should
appear at most once. It is only usefull to use this when
generating a LaTeX file from a single
HTML file.
With this command the place where the bibliography should
be included is specified.
It causes LaTeX bibitems to be generated for all excluded HTML
files (and other URL's), at the current location of the HTML file.
This should appear at the end of the HTML file, and should
appear at most once. It is only usefull to use this when
generating a LaTeX file from a single
HTML file.
Defining mappings
As we wrote above the various mappings of HTML tags to LaTeX can
be changed in both the input file (as a line
of the form %html -d tag-name options "LaTeX-open"
"LaTeX-close"),
and inside comments in the HTML files (in the form of
latex-def tag-name options "LaTeX-open"
"LaTeX-close").
To be used for math mode. This mode assumes that everything that
is inside the tags, is correct for the LaTeX math environment.
The contents is copied literally, except for # and %
which are quoted.
To be used in combination with -math to ignore
the HTML tags for italics as LaTeX math mode uses italics
by default.
Causes the text inside the HTML tags to be excluded from
the generated LaTeX file. The LaTeX-open
and LaTeX-close are both outputted to the
LaTeX file (if not inside another tag with -off).
Causes the text inside the HTML tags to be included from
the generated LaTeX file. At the start of the file generation
is switched off (one-level). In case of nested TAGS with -off,
the -on does only cancel one level. The LaTeX-open
and LaTeX-close are both outputted to the
LaTeX file (if not inside another tag with -off).
If both -on and -off are used (in an environment
with one level off), then only the LaTeX code for the tags is
generated.
To be used for the verbatim LaTeX environment. Ignores all
nested HTML tags that
would conflict with the LaTeX verbatim environment.
To be used for the alltt LaTeX environment, which
is like verbatim, but allows some additional formating.
To be used for HTML tags that produce an error message
when generated on an empty line (like \newline).
To be used for HTML tags which do not allow section commands
inside their generated LaTeX output.
To be used to indicate to which section-level a tag should be mapped
in LaTeX. The level at which the file is included is added.
If this option is used, then LaTeX-open and
LaTeX-close are ignored, except when the tag occurs in
an environment where an section heading cannot be generated.
%html -d html "" ""
%html -d head "" ""
%html -d title "" ""
%html -d body -on "" ""
%html -d address "" ""
%html -d h1 -l1 "{\\LARGE \\textbf{" "}}"
%html -d h2 -l2 "{\\Large \\textbf{" "}}"
%html -d h3 -l3 "{\\large \\textbf{" "}}"
%html -d h4 -l4 "\\textbf{" "}"
%html -d h5 -l5 "{\\small \\textbf{" "}}"
%html -d h6 -l6 "{\\footnotesize \\textbf{" "}}"
%html -d p "\nl\nl" ""
%html -d ul -igh "\nl\begin{itemize}" "\nl\end{itemize}\nl"
%html -d menu -igh "\nl\begin{itemize}" "\nl\end{itemize}\nl"
%html -d dir -gnh "\nl\begin{itemize}" "\nl\end{itemize}\nl"
%html -d ol -igh "\nl\begin{enumerate}" "\nl\end{enumerate}\nl"
%html -d li "\nl\item " ""
%html -d lh "\nl\item " ""
%html -d dl -igh "\nl\begin{description}" "\nl\end{description}\nl"
%html -d dt "\nl\item[" "]"
%html -d dd "" ""
%html -d a "" ""
%html -d q "``" "''"
%html -d i -iim "\textit{" "}"
%html -d em "\emph{" "}"
%html -d b "\textbf{" "}"
%html -d strong "\textbf{" "}"
%html -d tt "\texttt{" "}"
%html -d samp "\texttt{" "}"
%html -d kbd "\texttt{" "}"
%html -d var "\textsl{" "}"
%html -d dfn "\textsc{" "}"
%html -d code "\texttt{" "}"
%html -d blink "" ""
%html -d cite "\emph{" "}"
%html -d blockquote -igh "\begin{quotation} " "\end{quotation}\nl"
%html -d bq -igh "\begin{quotation} " "\end{quotation}\nl"
%html -d u "\underbar{" "}"
%html -d pre -verb "\begin{verbatim} " "\end{verbatim}\nl"
%html -d xmp -verb "\begin{verbatim} " "\end{verbatim}\nl"
%html -d listing -verb "\begin{verbatim} " "\end{verbatim}\nl"
%html -d br -br "\newline\nl" ""
%html -d hr "\vspace{1mm}\hrule " ""
%html -d img "" ""
%html -d isindex "" ""
%html -d select "" ""
%html -d link "" ""
%html -d center "{\centering " "}"
%html -d meta "" ""
%html -d table "" ""
%html -d tr "" ""
%html -d td "" ""
%html -d sup "$^{" "}$"
%html -d sub "$_{" "}$"
%html -d caption "" ""
%html -d script -off "" ""
%html -d noscript "" ""
%html -d style -off "" ""
%html -d font "" ""
Suggested alternative settings for the various tags are:
%html -d title -on "\newpage\thispagestyle{myheadings}\markright{\sc{}" "}\pagenumbering{arabic}\nl\nl"
%html -d h1 -l1 "{\nl\nl\smallskip\LARGE\bf\noindent " "}\nl\nl\noindent{}"
%html -d h2 -l2 "{\nl\nl\smallskip\Large\bf\noindent " "}\nl\nl\noindent{}"
%html -d h3 -l3 "{\nl\nl\smallskip\large\bf\noindent " "}\nl\nl\noindent{}"
%html -d h4 -l4 "{\nl\nl\smallskip\bf\noindent " "}\nl\nl\noindent{}"
%html -d h5 -l5 "{\nl\nl\smallskip\small\bf\noindent " "}\nl\nl\noindent{}"
%html -d h6 -l6 "{\nl\nl\smallskip\footnotesize\bf\noindent " "}\nl\nl\noindent{}"
%html -d code -math
%html -d blockquote "\nl{\parindent=2em\narrower\nl" "\nl}\nl"
The default setting for the pseudo tags for the book and report styles are:
%html -d l1 "\nl\nl\chapter{" "}\nl\nl"
%html -d l2 "\nl\nl\section{" "}\nl\nl"
%html -d l3 "\nl\nl\subsection{" "}\nl\nl"
%html -d l4 "\nl\nl\subsubsection{" "}\nl\nl"
%html -d l5 "\nl\nl\paragraph{" "}\nl"
%html -d l6 "\nl\nl\subparagraph{" "}\nl"
%html -d l7 "" ""
%html -d l8 "" ""
%html -d l9 "" ""
The default setting for the pseudo tags for the article styles is:
%html -d l1 "\nl\nl\section{" "}\nl\nl"
%html -d l2 "\nl\nl\subsection{" "}\nl\nl"
%html -d l3 "\nl\nl\subsubsection{" "}\nl\nl"
%html -d l4 "\nl\nl\paragraph{" "}\nl"
%html -d l5 "\nl\nl\subparagraph{" "}\nl"
%html -d l6 "" ""
%html -d l7 "" ""
%html -d l8 "" ""
%html -d l9 "" ""
The default setting for the pseudo tags for the plain style is:
%html -d l1 "\nl\nl\section*{" "}\nl\nl"
%html -d l2 "\nl\nl\subsection*{" "}\nl\nl"
%html -d l3 "\nl\nl\subsubsection*{" "}\nl\nl"
%html -d l4 "\nl\nl\paragraph*{" "}\nl"
%html -d l5 "\nl\nl\subparagraph*{" "}\nl"
%html -d l6 "" ""
%html -d l7 "" ""
%html -d l8 "" ""
%html -d l9 "" ""
Options
The options can be used to configure the LaTeX fragments which
are generated by the program for the various kinds of references.
The options can be given in the input file (as a line
of the form %html -o option-name option-value),
and inside comments in the HTML files (in the form of
latex-opt option-name option-value).
There are options that determine the cases in which references should be generated and when not. For example, it will often be the case that an HTML file contains a HREF tag, whenever an email address is given, which can be used to send an email. As the essential information is already provided it is not necessary to include it in a footnote or a bibliographic entry. The following options can be used for this purpose:
The references can be divided into internal and external. The internal references are HREF tags that point to a file that is included in the LaTeX output, and external are those that are not. Internal references can be mapped to phrases, that state to look at the corresponding section. External references have to be given completely, either as a footnote at the bottom of the page or as a bibliographic entry. They are generated as bibliographic entries if the input file contains a line with '%html -b' (or if the program option -b is given), otherwise they are generated as footnotes. There are four generation modes:
These are the options for internal references:
The options for external references as footnotes are:
The options for citations are:
The following options deal with the formating of all kinds of references. They make it possible to add additional formating around the anchor text or the image tag. The "%R" indicates the place where the reference should be placed. This can either be an internal or an external reference, in the running text or as a footnote. In case the "%R" appears in an fragile environment, it should be changed into "%fR". In case it appears in a place where a \footnote would not be proper, a combination of an "%mR" and an "%tR" can be used to indicate the place of the footnote marker and the footnote text, respectively. (An "f" can be added if they occur in a fragile environment.)
Suport for tables is still minimal, but the following two options are related to converting tables:
<!--latex-def table " \begin{tabular}{|p{3.5cm}|p{8cm}|}\hline " " \end{tabular} "--> <!--latex-opt tab_row_sep " \\ "--> <!--latex-opt tab_cell_sep " & "--> <!--latex-def th " \textbf{" " } "--> <TABLE> <TR><TH>A</TH><TH>B</TH></TR> <TR><TD>1</TD><TD>2</TD></TR> </TABLE>
The program recognizes the following command line options:
Known bugs are:
#define ASCII8
As a spin-off of this program, I developed the program chkhtml.c, which I use as part of my tools for maintaining this web site.