Class ExemptionUrlFilter

  • All Implemented Interfaces:
    Configurable, URLExemptionFilter, URLFilter, Pluggable

    public class ExemptionUrlFilter
    extends RegexURLFilter
    implements URLExemptionFilter
    This implementation of URLExemptionFilter uses regex configuration to check if URL is eligible for exemption from 'db.ignore.external'. When this filter is enabled, the external urls will be checked against configured sequence of regex rules.

    The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml

    The exemption rules are specified in plain text file where each line is a rule. The format is same same as `regex-urlfilter.txt`. Each non-comment, non-blank line contains a regular expression prefixed by '+' or '-'. The first matching pattern in the file determines whether a URL is exempted or ignored. If no pattern matches, the URL is ignored.
    Since:
    Feb 10, 2016
    Version:
    1
    See Also:
    URLExemptionFilter, RegexURLFilter
    • Constructor Detail

      • ExemptionUrlFilter

        public ExemptionUrlFilter()
    • Method Detail

      • filter

        public boolean filter​(String fromUrl,
                              String toUrl)
        Description copied from interface: URLExemptionFilter
        Checks if toUrl is exempted when the ignore external is enabled
        Specified by:
        filter in interface URLExemptionFilter
        Parameters:
        fromUrl - : the source url which generated the outlink
        toUrl - : the destination url which needs to be checked for exemption
        Returns:
        true when toUrl is exempted from dbIgnore
      • main

        public static void main​(String[] args)