Class ScoringFilters

    • Constructor Detail

    • Method Detail

      • generatorSortValue

        public float generatorSortValue​(Text url,
                                        CrawlDatum datum,
                                        float initSort)
                                 throws ScoringFilterException
        Calculate a sort value for Generate.
        Specified by:
        generatorSortValue in interface ScoringFilter
        Parameters:
        url - url of the page
        datum - page's datum, should not be modified
        initSort - initial sort value, or a value from previous filters in chain
        Returns:
        a sort value for use in sorting and selecting the top N scoring pages during fetchlist generation
        Throws:
        ScoringFilterException - if there is a fatal error preparing the sort value
      • initialScore

        public void initialScore​(Text url,
                                 CrawlDatum datum)
                          throws ScoringFilterException
        Calculate a new initial score, used when adding newly discovered pages.
        Specified by:
        initialScore in interface ScoringFilter
        Parameters:
        url - url of the page
        datum - new datum. Filters will modify it in-place.
        Throws:
        ScoringFilterException - if there is a fatal error setting an initial score for newly discovered pages
      • injectedScore

        public void injectedScore​(Text url,
                                  CrawlDatum datum)
                           throws ScoringFilterException
        Calculate a new initial score, used when injecting new pages.
        Specified by:
        injectedScore in interface ScoringFilter
        Parameters:
        url - url of the page
        datum - new datum. Filters will modify it in-place.
        Throws:
        ScoringFilterException - if there is a fatal error setting an initial score for newly injected pages
      • updateDbScore

        public void updateDbScore​(Text url,
                                  CrawlDatum old,
                                  CrawlDatum datum,
                                  List<CrawlDatum> inlinked)
                           throws ScoringFilterException
        Calculate updated page score during CrawlDb.update().
        Specified by:
        updateDbScore in interface ScoringFilter
        Parameters:
        url - url of the page
        old - original datum, with original score. May be null if this is a newly discovered page. If not null, filters should use score values from this parameter as the starting values - the datum parameter may contain values that are no longer valid, if other updates occurred between generation and this update.
        datum - the new datum, with the original score saved at the time when fetchlist was generated. Filters should update this in-place, and it will be saved in the crawldb.
        inlinked - (partial) list of CrawlDatum-s (with their scores) from links pointing to this page, found in the current update batch.
        Throws:
        ScoringFilterException - there is a fatal error calculating a new score of CrawlDatum during CrawlDb update
      • passScoreBeforeParsing

        public void passScoreBeforeParsing​(Text url,
                                           CrawlDatum datum,
                                           Content content)
                                    throws ScoringFilterException
        Description copied from interface: ScoringFilter
        This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it into Content metadata. This is needed in order to pass this value(s) to the mechanism that distributes it to outlinked pages.
        Specified by:
        passScoreBeforeParsing in interface ScoringFilter
        Parameters:
        url - url of the page
        datum - source datum. NOTE: modifications to this value are not persisted.
        content - instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.
        Throws:
        ScoringFilterException - if there is a fatal error injecting score information from the current datum into Content metadata
      • passScoreAfterParsing

        public void passScoreAfterParsing​(Text url,
                                          Content content,
                                          Parse parse)
                                   throws ScoringFilterException
        Description copied from interface: ScoringFilter
        Currently a part of score distribution is performed using only data coming from the parsing process. We need this method in order to ensure the presence of score data in these steps.
        Specified by:
        passScoreAfterParsing in interface ScoringFilter
        Parameters:
        url - page url
        content - original content. NOTE: modifications to this value are not persisted.
        parse - target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.
        Throws:
        ScoringFilterException - if there is a fatal error processing score data in subsequent steps after parsing
      • distributeScoreToOutlinks

        public CrawlDatum distributeScoreToOutlinks​(Text fromUrl,
                                                    ParseData parseData,
                                                    Collection<Map.Entry<Text,​CrawlDatum>> targets,
                                                    CrawlDatum adjust,
                                                    int allCount)
                                             throws ScoringFilterException
        Description copied from interface: ScoringFilter
        Distribute score value from the current page to all its outlinked pages.
        Specified by:
        distributeScoreToOutlinks in interface ScoringFilter
        Parameters:
        fromUrl - url of the source page
        parseData - ParseData instance, which stores relevant score value(s) in its metadata. NOTE: filters may modify this in-place, all changes will be persisted.
        targets - <url, CrawlDatum> pairs. NOTE: filters can modify this in-place, all changes will be persisted.
        adjust - a CrawlDatum instance, initially null, which implementations may use to pass adjustment values to the original CrawlDatum. When creating this instance, set its status to CrawlDatum.STATUS_LINKED.
        allCount - number of all collected outlinks from the source page
        Returns:
        if needed, implementations may return an instance of CrawlDatum, with status CrawlDatum.STATUS_LINKED, which contains adjustments to be applied to the original CrawlDatum score(s) and metadata. This can be null if not needed.
        Throws:
        ScoringFilterException - there is a fatal error distributing score data from the current page to all of its outlinks
      • indexerScore

        public float indexerScore​(Text url,
                                  NutchDocument doc,
                                  CrawlDatum dbDatum,
                                  CrawlDatum fetchDatum,
                                  Parse parse,
                                  Inlinks inlinks,
                                  float initScore)
                           throws ScoringFilterException
        Description copied from interface: ScoringFilter
        This method calculates a indexed document score/boost.
        Specified by:
        indexerScore in interface ScoringFilter
        Parameters:
        url - url of the page
        doc - indexed document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.
        dbDatum - current page from CrawlDb. NOTE:
        • changes made to this instance are not persisted
        • may be null if indexing is done without CrawlDb or if the segment is generated not from the CrawlDb (via FreeGenerator).
        fetchDatum - datum from FetcherOutput (containing among others the fetching status)
        parse - parsing result. NOTE: changes made to this instance are not persisted.
        inlinks - current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.
        initScore - initial boost value for the indexed document.
        Returns:
        boost value for the indexed document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying the indexed document directly.
        Throws:
        ScoringFilterException - if there is a fatal error whilst calculating the indexed document score/boost