|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectedu.stanford.nlp.annotation.HtmlCleaner
HtmlCleaner
removes various code elements
(style
, script
, applet
, and so on)
from an HTML document.
HtmlCleaner
is built on top of the HtmlParser package
written by
Quiotix, which compresses vertical white space and outputs an html
document with consistent syntax (all
html is lower case, even spacing in tags, no quotes for attributes
except for file names and string literals). HtmlCleaner adds filters
for comments, meta and area tags, and for script, style, server, and
applet tags along with the text that is contained between those tags, which
generally do not add to the "semantics" of a web page. In addition, the
user may specify any additional tags to filter out.
This is
usually a necessary first step for processing html documents before
passing them into TaggedStreamTokenizer
, as the TST makes
no attempt to fix spacing, etc, (though it can filter tags and comments).
HtmlCleaner can be run in batch mode using default settings, with a shell script similar to the following; note that only single-word html file names can work in this script -- I am not a script expert, and I never bothered finding a way around this.
#!/bin/ksh
#
DIR=docs
OUTDIR=docs/cleaned
# ------------
for FILE in $DIR/*.htm*
do
echo $FILE
FILEROOT=${FILE%.*}
OUTFILE="$FILEROOT-c.html"
java HtmlCleaner $FILE > $OUTFILE
done
mv $DIR/*-c.html $OUTDIR
Some of the code has been customized for Annotator. For example, unnecessary quotes are no longer stripped from attributes, since they are required by the Annotator. Also, ignoreAttributes does not hold for "tag" tags. The attributes for those tags are always displayed, again since they are required by the Annotator.
Field Summary | |
static String[] |
defaultAcceptableTags
|
Constructor Summary | |
HtmlCleaner(InputStream in,
OutputStream os)
Creates a new HtmlCleaner with default settings. |
|
HtmlCleaner(Reader r,
OutputStream os)
Creates a new HtmlCleaner with default settings. |
Method Summary | |
void |
addAcceptableTag(String tag)
Adds the given tag to the acceptable tag set. |
void |
addIgnore(String tagName,
boolean spans)
Specifies a type of html tag to ignore. |
void |
clean()
Cleans the html file specified in the constructor and writes the output to the outstream specifed in the constructor. |
static Set |
getDefaultAcceptableTags()
Returns the default acceptable tags as a Set of Strings. |
static void |
main(String[] args)
Runs HtmlCleaner with default settings on a
specified file, printing the cleaned html to standard out. |
void |
removeAcceptableTag(String tag)
Removes the given tag from the aceptable tag set, if the set is not null |
void |
setAcceptableTags(Set acceptableTags)
Set the set of acceptable tags, null clears. |
void |
setDefaultIgnores(boolean val)
The default html elements to ignore are: comments; script, style, server, and applet tags and the text within those tags; meta and area tags. |
void |
setIgnoreAttributes(boolean ignore)
Sets whether tag attributes will be omitted from output or not. |
void |
setIgnoreComments(boolean val)
Specifies whether comments should be ignored. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final String[] defaultAcceptableTags
Constructor Detail |
public HtmlCleaner(InputStream in, OutputStream os)
HtmlCleaner
with default settings.
in
- the input stream of the html to be cleanedpublic HtmlCleaner(Reader r, OutputStream os)
HtmlCleaner
with default settings.
Method Detail |
public void addAcceptableTag(String tag)
public void removeAcceptableTag(String tag)
public void setAcceptableTags(Set acceptableTags)
public static Set getDefaultAcceptableTags()
public void setIgnoreAttributes(boolean ignore)
public void clean() throws IOException, com.quiotix.html.parser.ParseException
IOException
com.quiotix.html.parser.ParseException
public void setDefaultIgnores(boolean val)
val
- true indicates the default html is ignoredpublic void setIgnoreComments(boolean val)
setDefaultIgnores
overrides this setting.
val
- true indicates comments are ignoredpublic void addIgnore(String tagName, boolean spans)
setDefaultIgnores
, that tag will no longer be
ignored. Only html-style tags are supported.
tagName
- the name of the tag to be ignored, ie "table"
or "font" without brackets or attributes.spans
- true indicates the tag comes in start/end pairs,
such as font or table. Pass in false for tags such as br.public static void main(String[] args)
HtmlCleaner
with default settings on a
specified file, printing the cleaned html to standard out.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |