Class RegexURLFilterBase

    • Field Detail

      • hasHostDomainRules

        protected boolean hasHostDomainRules
        Whether there are host- or domain-specific rules. If there are no specific rules host and domain name are not extracted from the URL to speed up the matching. readRules(Reader) automatically sets this to true if host- or domain-specific rules are used in the rule file.
    • Constructor Detail

      • RegexURLFilterBase

        public RegexURLFilterBase()
        Constructs a new empty RegexURLFilterBase
      • RegexURLFilterBase

        public RegexURLFilterBase​(String rules)
                           throws IOException,
                                  IllegalArgumentException
        Constructs a new RegexURLFilter and inits it with a list of rules.
        Parameters:
        rules - string with a list of rules, one rule per line
        Throws:
        IOException - if there is a fatal I/O error interpreting the input rules
        IllegalArgumentException - if there is a fatal error processing the regex rules wiuthin the URLFilter
    • Method Detail

      • createRule

        protected abstract RegexRule createRule​(boolean sign,
                                                String regex)
        Creates a new RegexRule.
        Parameters:
        sign - of the regular expression. A true value means that any URL matching this rule must be included, whereas a false value means that any URL matching this rule must be excluded.
        regex - is the regular expression associated to this rule.
        Returns:
        RegexRule
      • createRule

        protected abstract RegexRule createRule​(boolean sign,
                                                String regex,
                                                String hostOrDomain)
        Creates a new RegexRule.
        Parameters:
        sign - of the regular expression. A true value means that any URL matching this rule must be included, whereas a false value means that any URL matching this rule must be excluded.
        regex - is the regular expression associated to this rule.
        hostOrDomain - the host or domain to which this regex belongs
        Returns:
        RegexRule
      • getRulesReader

        protected abstract Reader getRulesReader​(Configuration conf)
                                          throws IOException
        Returns the name of the file of rules to use for a particular implementation.
        Parameters:
        conf - is the current configuration.
        Returns:
        the name of the resource containing the rules to use.
        Throws:
        IOException - if there is a fatal error obtaining the Reader
      • filter

        public String filter​(String url)
        Description copied from interface: URLFilter
        Interface for a filter that transforms a URL: it can pass the original URL through or "delete" the URL by returning null
        Specified by:
        filter in interface URLFilter
        Parameters:
        url - the URL string the filter is applied on
        Returns:
        the original URL string if the URL is accepted by the filter or null in case the URL is rejected
      • main

        public static void main​(RegexURLFilterBase filter,
                                String[] args)
                         throws IOException,
                                IllegalArgumentException
        Filter the standard input using a RegexURLFilterBase.
        Parameters:
        filter - is the RegexURLFilterBase to use for filtering the standard input.
        args - some optional parameters (not used).
        Throws:
        IOException - if there is a fatal I/O error interpreting the input arguments
        IllegalArgumentException - if there is a fatal error processing the input arguments