Class LanguageIndexingFilter

  • All Implemented Interfaces:
    Configurable, IndexingFilter, Pluggable

    public class LanguageIndexingFilter
    extends Object
    implements IndexingFilter
    An IndexingFilter that add a lang (language) field to the document. It tries to find the language of the document by:
    • First, checking if HTMLLanguageParser add some language information
    • Then, checking if a Content-Language HTTP header can be found
    • Finaly by analyzing the document content
    Author:
    Sami Siren, Jerome Charron
    • Constructor Detail

      • LanguageIndexingFilter

        public LanguageIndexingFilter()
        Constructs a new Language Indexing Filter.
    • Method Detail

      • filter

        public NutchDocument filter​(NutchDocument doc,
                                    Parse parse,
                                    Text url,
                                    CrawlDatum datum,
                                    Inlinks inlinks)
                             throws IndexingException
        Description copied from interface: IndexingFilter
        Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.
        Specified by:
        filter in interface IndexingFilter
        Parameters:
        doc - document instance for collecting fields
        parse - parse data instance
        url - page url
        datum - crawl datum for the page (fetch datum from segment containing fetch status and fetch time)
        inlinks - page inlinks
        Returns:
        modified (or a new) document instance, or null (meaning the document should be discarded)
        Throws:
        IndexingException - if an error occurs during during filtering