Package org.apache.nutch.parse.headings
Class HeadingsParseFilter
- java.lang.Object
-
- org.apache.nutch.parse.headings.HeadingsParseFilter
-
- All Implemented Interfaces:
Configurable
,HtmlParseFilter
,Pluggable
public class HeadingsParseFilter extends Object implements HtmlParseFilter
HtmlParseFilter to retrieve h1 and h2 values from the DOM.
-
-
Field Summary
Fields Modifier and Type Field Description protected static Pattern
whitespacePattern
Pattern used to strip surpluss whitespace-
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description HeadingsParseFilter()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description ParseResult
filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.Configuration
getConf()
protected List<String>
getElement(DocumentFragment doc, String element)
Finds the specified element and returns its valueprotected static String
getNodeValue(Node node)
Returns the text value of the specified Node and child nodesvoid
setConf(Configuration conf)
-
-
-
Field Detail
-
whitespacePattern
protected static Pattern whitespacePattern
Pattern used to strip surpluss whitespace
-
-
Method Detail
-
filter
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Description copied from interface:HtmlParseFilter
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.- Specified by:
filter
in interfaceHtmlParseFilter
- Parameters:
content
- theContent
for a given responseparseResult
- the result of running on or moreParser
's on the content.metaTags
- a populatedHTMLMetaTags
objectdoc
- aDocumentFragment
(DOM) which can be processed in the filtering process.- Returns:
- a filtered
ParseResult
- See Also:
Parser.getParse(Content)
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
getElement
protected List<String> getElement(DocumentFragment doc, String element)
Finds the specified element and returns its value- Parameters:
doc
- the inputDocumentFragment
to processelement
- the element to find in the DocumentFragment- Returns:
- a
List
containing headings
-
-