Package org.apache.nutch.urlfilter.fast
Class FastURLFilter
- java.lang.Object
-
- org.apache.nutch.urlfilter.fast.FastURLFilter
-
- All Implemented Interfaces:
Configurable
,URLFilter
,Pluggable
public class FastURLFilter extends Object implements URLFilter
Filters URLs based on a file of regular expressions using host/domains matching first. The default policy is to accept a URL if no matches are found. Rule Format:Host www.example.org DenyPath /path/to/be/excluded DenyPath /some/other/path/excluded # Deny everything from *.example.com and example.com Domain example.com DenyPath .* Domain example.org DenyPathQuery /resource/.*?action=exclude
Host
rules are evaluated beforeDomain
rules. ForHost
rules the entire host name of a URL must match while the domain names inDomain
rules are considered as matches if the domain is a suffix of the host name (consisting of complete host name parts). Shorter domain suffixes are checked first, a single dot ".
" as "domain name" can be used to specify global rules applied to every URL. E.g., for "www.example.com" the rules given above are looked up in the following order:- check "www.example.com" whether host-based rules exist and whether one of them matches
- check "www.example.com" for domain-based rules
- check "example.com" for domain-based rules
- check "com" for domain-based rules
- check for global rules ("
Domain .
")
file:/path/file.txt
are checked for global rules only. URLs which fail to be parsed asURL
are always rejected. For rules either the URL path (DenyPath
) or path and query (DenyPathQuery
) are checked whether the givenJava Regular expression
is found (seeMatcher.find()
) in the URL path (and query). Rules are applied in the order of their definition. For better performance, regular expressions which are simpler/faster or match more URLs should be defined earlier. Comments in the rule file start with the#
character and reach until the end of the line. The rules file is defined via the propertyurlfilter.fast.file
, the default name isfast-urlfilter.txt
.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
FastURLFilter.DenyAllRule
Rule forDenyPath .*
orDenyPath .?
static class
FastURLFilter.DenyPathQueryRule
static class
FastURLFilter.DenyPathRule
static class
FastURLFilter.Rule
-
Field Summary
Fields Modifier and Type Field Description protected static org.slf4j.Logger
LOG
static String
URLFILTER_FAST_FILE
-
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description FastURLFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description String
filter(String url)
Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning nullConfiguration
getConf()
void
reloadRules()
void
setConf(Configuration conf)
-
-
-
Field Detail
-
LOG
protected static final org.slf4j.Logger LOG
-
URLFILTER_FAST_FILE
public static final String URLFILTER_FAST_FILE
- See Also:
- Constant Field Values
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
-
filter
public String filter(String url)
Description copied from interface:URLFilter
Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
-
reloadRules
public void reloadRules() throws IOException
- Throws:
IOException
-
-