Class BM25Similarity


  • public class BM25Similarity
    extends Similarity
    BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
    • Constructor Detail

      • BM25Similarity

        public BM25Similarity​(float k1,
                              float b,
                              boolean discountOverlaps)
        BM25 with the supplied parameter values.
        Parameters:
        k1 - Controls non-linear term frequency normalization (saturation).
        b - Controls to what degree document length normalizes tf values.
        discountOverlaps - True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
        Throws:
        IllegalArgumentException - if k1 is infinite or negative, or if b is not within the range [0..1]
      • BM25Similarity

        public BM25Similarity​(float k1,
                              float b)
        BM25 with the supplied parameter values.
        Parameters:
        k1 - Controls non-linear term frequency normalization (saturation).
        b - Controls to what degree document length normalizes tf values.
        Throws:
        IllegalArgumentException - if k1 is infinite or negative, or if b is not within the range [0..1]
      • BM25Similarity

        public BM25Similarity​(boolean discountOverlaps)
        BM25 with these default values:
        • k1 = 1.2
        • b = 0.75
        and the supplied parameter value:
        Parameters:
        discountOverlaps - True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
      • BM25Similarity

        public BM25Similarity()
        BM25 with these default values:
        • k1 = 1.2
        • b = 0.75
        • discountOverlaps = true
    • Method Detail

      • idf

        protected float idf​(long docFreq,
                            long docCount)
        Implemented as log(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5)).
      • avgFieldLength

        protected float avgFieldLength​(CollectionStatistics collectionStats)
        The default implementation computes the average as sumTotalTermFreq / docCount
      • idfExplain

        public Explanation idfExplain​(CollectionStatistics collectionStats,
                                      TermStatistics[] termStats)
        Computes a score factor for a phrase.

        The default implementation sums the idf factor for each term in the phrase.

        Parameters:
        collectionStats - collection-level statistics
        termStats - term-level statistics for the terms in the phrase
        Returns:
        an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
      • scorer

        public final Similarity.SimScorer scorer​(float boost,
                                                 CollectionStatistics collectionStats,
                                                 TermStatistics... termStats)
        Description copied from class: Similarity
        Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.
        Specified by:
        scorer in class Similarity
        Parameters:
        boost - a multiplicative factor to apply to the produces scores
        collectionStats - collection-level statistics, such as the number of tokens in the collection.
        termStats - term-level statistics, such as the document frequency of a term across the collection.
        Returns:
        SimWeight object with the information this Similarity needs to score a query.