Package org.apache.nutch.any23
Class Any23ParseFilter
- java.lang.Object
-
- org.apache.nutch.any23.Any23ParseFilter
-
- All Implemented Interfaces:
Configurable
,HtmlParseFilter
,Pluggable
public class Any23ParseFilter extends Object implements HtmlParseFilter
This implementation of
HtmlParseFilter
uses the Apache Any23 library for parsing and extracting structured data in RDF format from a variety of Web documents. The supported formats can be found at Apache Any23.In this implementation triples are written as Notation3 and triples are identified within output triple streams by the presence of '\n'. The presence of the '\n' is a characteristic specific to N3 serialization in Any23. In order to use another/other writers implementing the TripleHandler interface, we will most likely need to identify an alternative data characteristic which we can use to split triples streams.
-
-
Field Summary
Fields Modifier and Type Field Description static String
ANY_23_CONTENT_TYPES_CONF
static String
ANY_23_EXTRACTORS_CONF
static String
ANY23_TRIPLES
Constant identifier used as a Key for writing and reading triples to and from the metadata Map field.-
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description Any23ParseFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description ParseResult
filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.Configuration
getConf()
void
setConf(Configuration conf)
-
-
-
Field Detail
-
ANY23_TRIPLES
public static final String ANY23_TRIPLES
Constant identifier used as a Key for writing and reading triples to and from the metadata Map field.- See Also:
- Constant Field Values
-
ANY_23_EXTRACTORS_CONF
public static final String ANY_23_EXTRACTORS_CONF
- See Also:
- Constant Field Values
-
ANY_23_CONTENT_TYPES_CONF
public static final String ANY_23_CONTENT_TYPES_CONF
- See Also:
- Constant Field Values
-
-
Method Detail
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
filter
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Description copied from interface:HtmlParseFilter
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.- Specified by:
filter
in interfaceHtmlParseFilter
- Parameters:
content
- theContent
for a given responseparseResult
- the result of running on or moreParser
's on the content.metaTags
- a populatedHTMLMetaTags
objectdoc
- aDocumentFragment
(DOM) which can be processed in the filtering process.- Returns:
- a filtered
ParseResult
- See Also:
HtmlParseFilter.filter(Content, ParseResult, HTMLMetaTags, DocumentFragment)
-
-