Class MoreIndexingFilter

  • All Implemented Interfaces:
    Configurable, IndexingFilter, Pluggable

    public class MoreIndexingFilter
    extends Object
    implements IndexingFilter
    Add (or reset) a few metaData properties as respective fields (if they are available), so that they can be accurately used within the search index. 'lastModifed' is indexed to support query by date, 'contentLength' obtains content length from the HTTP header, 'type' field is indexed to support query by type and finally the 'title' field is an attempt to reset the title if a content-disposition hint exists. The logic is that such a presence is indicative that the content provider wants the filename therein to be used as the title. Still need to make content-length searchable!
    Author:
    John Xing
    • Constructor Detail

      • MoreIndexingFilter

        public MoreIndexingFilter()
    • Method Detail

      • filter

        public NutchDocument filter​(NutchDocument doc,
                                    Parse parse,
                                    Text url,
                                    CrawlDatum datum,
                                    Inlinks inlinks)
                             throws IndexingException
        Description copied from interface: IndexingFilter
        Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.
        Specified by:
        filter in interface IndexingFilter
        Parameters:
        doc - document instance for collecting fields
        parse - parse data instance
        url - page url
        datum - crawl datum for the page (fetch datum from segment containing fetch status and fetch time)
        inlinks - page inlinks
        Returns:
        modified (or a new) document instance, or null (meaning the document should be discarded)
        Throws:
        IndexingException - if an error occurs during during filtering