Class LinksIndexingFilter

  • All Implemented Interfaces:
    Configurable, IndexingFilter, Pluggable

    public class LinksIndexingFilter
    extends Object
    implements IndexingFilter
    An IndexingFilter that adds outlinks and inlinks field(s) to the document. In case that you want to ignore the outlinks that point to the same host as the URL being indexed use the following settings in your configuration file: <property> <name>index.links.outlinks.host.ignore</name> <value>true</value> </property> The same configuration is available for inlinks: <property> <name>index.links.inlinks.host.ignore</name> <value>true</value> </property> To store only the host portion of each inlink URL or outlink URL add the following to your configuration file. <property> <name>index.links.hosts.only</name> <value>false</value> </property>
    • Constructor Detail

      • LinksIndexingFilter

        public LinksIndexingFilter()
    • Method Detail

      • filter

        public NutchDocument filter​(NutchDocument doc,
                                    Parse parse,
                                    Text url,
                                    CrawlDatum datum,
                                    Inlinks inlinks)
                             throws IndexingException
        Description copied from interface: IndexingFilter
        Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.
        Specified by:
        filter in interface IndexingFilter
        Parameters:
        doc - document instance for collecting fields
        parse - parse data instance
        url - page url
        datum - crawl datum for the page (fetch datum from segment containing fetch status and fetch time)
        inlinks - page inlinks
        Returns:
        modified (or a new) document instance, or null (meaning the document should be discarded)
        Throws:
        IndexingException - if an error occurs during during filtering