Class DomainDenylistURLFilter

  • All Implemented Interfaces:
    Configurable, URLFilter, Pluggable

    public class DomainDenylistURLFilter
    extends Object
    implements URLFilter

    Filters URLs based on a file containing domain suffixes, domain names, and hostnames. A URL that matches one of the suffixes, domains, or hosts present in the file is filtered out.

    URLs are checked in order of domain suffix, domain name, and hostname against entries in the domain file. The domain file would be setup as follows with one entry per line:

     com
     apache.org
     www.apache.org
     

    The first line is an example of a filter that would exclude all .com domains. The second line excludes all URLs from apache.org and all of its subdomains such as lucene.apache.org and hadoop.apache.org. The third line would exclude only URLs from www.apache.org. There is no specific ordering to entries. The entries are from more general to more specific with the more general overriding the more specific.

    The domain file defaults to domaindenylist-urlfilter.txt in the classpath but can be overridden using the:
    • property "urlfilter.domaindenylist.file" in ./conf/nutch-*.xml, and
    • attribute "file" in plugin.xml of this plugin
    • Constructor Detail

      • DomainDenylistURLFilter

        public DomainDenylistURLFilter()
    • Method Detail

      • filter

        public String filter​(String url)
        Description copied from interface: URLFilter
        Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
        Specified by:
        filter in interface URLFilter
        Parameters:
        url - the URL string the filter is applied on
        Returns:
        the original URL string if the URL is accepted by the filter or null in case the URL is rejected