Class BM25Similarity
- java.lang.Object
-
- org.apache.lucene.search.similarities.Similarity
-
- org.apache.lucene.search.similarities.BM25Similarity
-
public class BM25Similarity extends Similarity
BM25 Similarity. Introduced in Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. Okapi at TREC-3. In Proceedings of the Third Text REtrieval Conference (TREC 1994). Gaithersburg, USA, November 1994.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
Similarity.SimScorer
-
-
Constructor Summary
Constructors Constructor Description BM25Similarity()
BM25 with these default values:k1 = 1.2
b = 0.75
discountOverlaps = true
BM25Similarity(boolean discountOverlaps)
BM25 with these default values:k1 = 1.2
b = 0.75
and the supplied parameter value:BM25Similarity(float k1, float b)
BM25 with the supplied parameter values.BM25Similarity(float k1, float b, boolean discountOverlaps)
BM25 with the supplied parameter values.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected float
avgFieldLength(CollectionStatistics collectionStats)
The default implementation computes the average assumTotalTermFreq / docCount
float
getB()
Returns theb
parameterfloat
getK1()
Returns thek1
parameterprotected float
idf(long docFreq, long docCount)
Implemented aslog(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5))
.Explanation
idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation for that score factor.Explanation
idfExplain(CollectionStatistics collectionStats, TermStatistics[] termStats)
Computes a score factor for a phrase.Similarity.SimScorer
scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats)
Compute any collection-level weight (e.g.String
toString()
-
Methods inherited from class org.apache.lucene.search.similarities.Similarity
computeNorm, getDiscountOverlaps
-
-
-
-
Constructor Detail
-
BM25Similarity
public BM25Similarity(float k1, float b, boolean discountOverlaps)
BM25 with the supplied parameter values.- Parameters:
k1
- Controls non-linear term frequency normalization (saturation).b
- Controls to what degree document length normalizes tf values.discountOverlaps
- True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.- Throws:
IllegalArgumentException
- ifk1
is infinite or negative, or ifb
is not within the range[0..1]
-
BM25Similarity
public BM25Similarity(float k1, float b)
BM25 with the supplied parameter values.- Parameters:
k1
- Controls non-linear term frequency normalization (saturation).b
- Controls to what degree document length normalizes tf values.- Throws:
IllegalArgumentException
- ifk1
is infinite or negative, or ifb
is not within the range[0..1]
-
BM25Similarity
public BM25Similarity(boolean discountOverlaps)
BM25 with these default values:k1 = 1.2
b = 0.75
- Parameters:
discountOverlaps
- True if overlap tokens (tokens with a position of increment of zero) are discounted from the document's length.
-
BM25Similarity
public BM25Similarity()
BM25 with these default values:k1 = 1.2
b = 0.75
discountOverlaps = true
-
-
Method Detail
-
idf
protected float idf(long docFreq, long docCount)
Implemented aslog(1 + (docCount - docFreq + 0.5)/(docFreq + 0.5))
.
-
avgFieldLength
protected float avgFieldLength(CollectionStatistics collectionStats)
The default implementation computes the average assumTotalTermFreq / docCount
-
idfExplain
public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics termStats)
Computes a score factor for a simple term and returns an explanation for that score factor.The default implementation uses:
idf(docFreq, docCount);
Note thatCollectionStatistics.docCount()
is used instead ofIndexReader#numDocs()
because alsoTermStatistics.docFreq()
is used, and when the latter is inaccurate, so isCollectionStatistics.docCount()
, and in the same direction. In addition,CollectionStatistics.docCount()
does not skew when fields are sparse.- Parameters:
collectionStats
- collection-level statisticstermStats
- term-level statistics for the term- Returns:
- an Explain object that includes both an idf score factor and an explanation for the term.
-
idfExplain
public Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics[] termStats)
Computes a score factor for a phrase.The default implementation sums the idf factor for each term in the phrase.
- Parameters:
collectionStats
- collection-level statisticstermStats
- term-level statistics for the terms in the phrase- Returns:
- an Explain object that includes both an idf score factor for the phrase and an explanation for each term.
-
scorer
public final Similarity.SimScorer scorer(float boost, CollectionStatistics collectionStats, TermStatistics... termStats)
Description copied from class:Similarity
Compute any collection-level weight (e.g. IDF, average document length, etc) needed for scoring a query.- Specified by:
scorer
in classSimilarity
- Parameters:
boost
- a multiplicative factor to apply to the produces scorescollectionStats
- collection-level statistics, such as the number of tokens in the collection.termStats
- term-level statistics, such as the document frequency of a term across the collection.- Returns:
- SimWeight object with the information this Similarity needs to score a query.
-
getK1
public final float getK1()
Returns thek1
parameter- See Also:
BM25Similarity(float, float)
-
getB
public final float getB()
Returns theb
parameter- See Also:
BM25Similarity(float, float)
-
-