Class NaiveBayesParseFilter
- java.lang.Object
-
- org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter
-
- All Implemented Interfaces:
Configurable
,HtmlParseFilter
,Pluggable
public class NaiveBayesParseFilter extends Object implements HtmlParseFilter
Html Parse filter that classifies the outlinks from the parseresult as relevant or irrelevant based on the parseText's relevancy (using a training file where you can give positive and negative example texts see the description of parsefilter.naivebayes.trainfile) and if found irrelevant it gives the link a second chance if it contains any of the words from the list given in parsefilter.naivebayes.wordlist. CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier.
-
-
Field Summary
Fields Modifier and Type Field Description static String
DICTFILE_MODELFILTER
static String
TRAINFILE_MODELFILTER
-
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description NaiveBayesParseFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description boolean
classify(String text)
boolean
containsWord(String url, ArrayList<String> wordlist)
ParseResult
filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.boolean
filterParse(String text)
boolean
filterUrl(String url)
Configuration
getConf()
void
setConf(Configuration conf)
void
train()
-
-
-
Field Detail
-
TRAINFILE_MODELFILTER
public static final String TRAINFILE_MODELFILTER
- See Also:
- Constant Field Values
-
DICTFILE_MODELFILTER
public static final String DICTFILE_MODELFILTER
- See Also:
- Constant Field Values
-
-
Method Detail
-
filterParse
public boolean filterParse(String text)
-
filterUrl
public boolean filterUrl(String url)
-
classify
public boolean classify(String text) throws IOException
- Throws:
IOException
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
filter
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Description copied from interface:HtmlParseFilter
Adds metadata or otherwise modifies a parse of HTML content, given the DOM tree of a page.- Specified by:
filter
in interfaceHtmlParseFilter
- Parameters:
content
- theContent
for a given responseparseResult
- the result of running on or moreParser
's on the content.metaTags
- a populatedHTMLMetaTags
objectdoc
- aDocumentFragment
(DOM) which can be processed in the filtering process.- Returns:
- a filtered
ParseResult
- See Also:
Parser.getParse(Content)
-
-