Class FastURLFilter

  • All Implemented Interfaces:
    Configurable, URLFilter, Pluggable

    public class FastURLFilter
    extends Object
    implements URLFilter
    Filters URLs based on a file of regular expressions using host/domains matching first. The default policy is to accept a URL if no matches are found. Rule Format:
     Host www.example.org
       DenyPath /path/to/be/excluded
       DenyPath /some/other/path/excluded
    
     # Deny everything from *.example.com and example.com
     Domain example.com
       DenyPath .*
    
     Domain example.org
       DenyPathQuery /resource/.*?action=exclude
     
    Host rules are evaluated before Domain rules. For Host rules the entire host name of a URL must match while the domain names in Domain rules are considered as matches if the domain is a suffix of the host name (consisting of complete host name parts). Shorter domain suffixes are checked first, a single dot "." as "domain name" can be used to specify global rules applied to every URL. E.g., for "www.example.com" the rules given above are looked up in the following order:
    1. check "www.example.com" whether host-based rules exist and whether one of them matches
    2. check "www.example.com" for domain-based rules
    3. check "example.com" for domain-based rules
    4. check "com" for domain-based rules
    5. check for global rules ("Domain .")
    The first matching rule will reject the URL and no further rules are checked. If no rule matches the URL is accepted. URLs without a host name (e.g., file:/path/file.txt are checked for global rules only. URLs which fail to be parsed as URL are always rejected. For rules either the URL path (DenyPath) or path and query (DenyPathQuery) are checked whether the given Java Regular expression is found (see Matcher.find()) in the URL path (and query). Rules are applied in the order of their definition. For better performance, regular expressions which are simpler/faster or match more URLs should be defined earlier. Comments in the rule file start with the # character and reach until the end of the line. The rules file is defined via the property urlfilter.fast.file, the default name is fast-urlfilter.txt.
    • Field Detail

      • LOG

        protected static final org.slf4j.Logger LOG
    • Constructor Detail

      • FastURLFilter

        public FastURLFilter()
    • Method Detail

      • filter

        public String filter​(String url)
        Description copied from interface: URLFilter
        Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
        Specified by:
        filter in interface URLFilter
        Parameters:
        url - the URL string the filter is applied on
        Returns:
        the original URL string if the URL is accepted by the filter or null in case the URL is rejected