Package org.apache.nutch.urlfilter.api
Class RegexURLFilterBase
- java.lang.Object
-
- org.apache.nutch.urlfilter.api.RegexURLFilterBase
-
- All Implemented Interfaces:
Configurable
,URLFilter
,Pluggable
- Direct Known Subclasses:
AutomatonURLFilter
,RegexURLFilter
public abstract class RegexURLFilterBase extends Object implements URLFilter
GenericURLFilter
based on regular expressions.The regular expressions rules are expressed in a file. The file of rules is determined for each implementation using the
getRulesReader(Configuration conf)
method.The format of this file is made of many rules (one per line):
[+-]<regex>
where plus (+
)means go ahead and index it and minus (-
)means no.- Author:
- Jérôme Charron
-
-
Field Summary
Fields Modifier and Type Field Description protected boolean
hasHostDomainRules
Whether there are host- or domain-specific rules.-
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Modifier Constructor Description RegexURLFilterBase()
Constructs a new empty RegexURLFilterBaseRegexURLFilterBase(File filename)
Constructs a new RegexURLFilter and init it with a file of rules.protected
RegexURLFilterBase(Reader reader)
Constructs a new RegexURLFilter and init it with a Reader of rules.RegexURLFilterBase(String rules)
Constructs a new RegexURLFilter and inits it with a list of rules.
-
Method Summary
All Methods Static Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description protected abstract RegexRule
createRule(boolean sign, String regex)
Creates a newRegexRule
.protected abstract RegexRule
createRule(boolean sign, String regex, String hostOrDomain)
Creates a newRegexRule
.String
filter(String url)
Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning nullConfiguration
getConf()
protected abstract Reader
getRulesReader(Configuration conf)
Returns the name of the file of rules to use for a particular implementation.static void
main(RegexURLFilterBase filter, String[] args)
Filter the standard input using a RegexURLFilterBase.void
setConf(Configuration conf)
-
-
-
Field Detail
-
hasHostDomainRules
protected boolean hasHostDomainRules
Whether there are host- or domain-specific rules. If there are no specific rules host and domain name are not extracted from the URL to speed up the matching.readRules(Reader)
automatically sets this to true if host- or domain-specific rules are used in the rule file.
-
-
Constructor Detail
-
RegexURLFilterBase
public RegexURLFilterBase()
Constructs a new empty RegexURLFilterBase
-
RegexURLFilterBase
public RegexURLFilterBase(File filename) throws IOException, IllegalArgumentException
Constructs a new RegexURLFilter and init it with a file of rules.- Parameters:
filename
- is the name of rules file.- Throws:
IOException
- if there is a fatal I/O error interpreting the inputFile
IllegalArgumentException
- if there is a fatal error processing the regex rules wiuthin theURLFilter
-
RegexURLFilterBase
public RegexURLFilterBase(String rules) throws IOException, IllegalArgumentException
Constructs a new RegexURLFilter and inits it with a list of rules.- Parameters:
rules
- string with a list of rules, one rule per line- Throws:
IOException
- if there is a fatal I/O error interpreting the input rulesIllegalArgumentException
- if there is a fatal error processing the regex rules wiuthin theURLFilter
-
RegexURLFilterBase
protected RegexURLFilterBase(Reader reader) throws IOException, IllegalArgumentException
Constructs a new RegexURLFilter and init it with a Reader of rules.- Parameters:
reader
- is a reader of rules.- Throws:
IOException
- if there is a fatal I/O error interpreting the inputReader
IllegalArgumentException
- if there is a fatal error processing the regex rules wiuthin theURLFilter
-
-
Method Detail
-
createRule
protected abstract RegexRule createRule(boolean sign, String regex)
Creates a newRegexRule
.- Parameters:
sign
- of the regular expression. Atrue
value means that any URL matching this rule must be included, whereas afalse
value means that any URL matching this rule must be excluded.regex
- is the regular expression associated to this rule.- Returns:
RegexRule
-
createRule
protected abstract RegexRule createRule(boolean sign, String regex, String hostOrDomain)
Creates a newRegexRule
.- Parameters:
sign
- of the regular expression. Atrue
value means that any URL matching this rule must be included, whereas afalse
value means that any URL matching this rule must be excluded.regex
- is the regular expression associated to this rule.hostOrDomain
- the host or domain to which this regex belongs- Returns:
RegexRule
-
getRulesReader
protected abstract Reader getRulesReader(Configuration conf) throws IOException
Returns the name of the file of rules to use for a particular implementation.- Parameters:
conf
- is the current configuration.- Returns:
- the name of the resource containing the rules to use.
- Throws:
IOException
- if there is a fatal error obtaining theReader
-
filter
public String filter(String url)
Description copied from interface:URLFilter
Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
main
public static void main(RegexURLFilterBase filter, String[] args) throws IOException, IllegalArgumentException
Filter the standard input using a RegexURLFilterBase.- Parameters:
filter
- is the RegexURLFilterBase to use for filtering the standard input.args
- some optional parameters (not used).- Throws:
IOException
- if there is a fatal I/O error interpreting the input argumentsIllegalArgumentException
- if there is a fatal error processing the input arguments
-
-