Package org.apache.nutch.parse.js
Class JSParseFilter
- java.lang.Object
-
- org.apache.nutch.parse.js.JSParseFilter
-
- All Implemented Interfaces:
Configurable
,HtmlParseFilter
,Parser
,Pluggable
public class JSParseFilter extends Object implements HtmlParseFilter, Parser
This class is a heuristic link extractor for JavaScript files and code snippets. The general idea of a two-pass regex matching comes from Heritrix. Parts of the code come from OutlinkExtractor.java
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.parse.HtmlParseFilter
X_POINT_ID
-
Fields inherited from interface org.apache.nutch.parse.Parser
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description JSParseFilter()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description ParseResult
filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Scan the JavaScript fragments of a HTML page looking for possibleOutlink
'sConfiguration
getConf()
ParseResult
getParse(Content c)
Parse a JavaScript file and extract outlinksstatic void
main(String[] args)
Main method which can be run from command line with the plugin option.void
setConf(Configuration conf)
-
-
-
Method Detail
-
filter
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
Scan the JavaScript fragments of a HTML page looking for possibleOutlink
's- Specified by:
filter
in interfaceHtmlParseFilter
- Parameters:
content
- page contentparseResult
- parsed content, result of running the HTML parsermetaTags
- within theHTMLMetaTags
doc
- TheDocumentFragment
object- Returns:
- parse the actual
ParseResult
object with additional outlinks from JavaScript - See Also:
Parser.getParse(Content)
-
getParse
public ParseResult getParse(Content c)
Parse a JavaScript file and extract outlinks
-
main
public static void main(String[] args) throws Exception
Main method which can be run from command line with the plugin option. The method takes two arguments e.g. o.a.n.parse.js.JSParseFilter file.js baseURL- Parameters:
args
- run with no args to get help- Throws:
Exception
- if there is a fatal error running the class with the given input
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
-