Class DFISimilarity
- java.lang.Object
-
- org.apache.lucene.search.similarities.Similarity
-
- org.apache.lucene.search.similarities.SimilarityBase
-
- org.apache.lucene.search.similarities.DFISimilarity
-
public class DFISimilarity extends SimilarityBase
Implements the Divergence from Independence (DFI) model based on Chi-square statistics (i.e., standardized Chi-squared distance from independence in term frequency tf).DFI is both parameter-free and non-parametric:
- parameter-free: it does not require any parameter tuning or training.
- non-parametric: it does not make any assumptions about word frequency distributions on document collections.
It is highly recommended not to remove stopwords (very common terms: the, of, and, to, a, in, for, is, on, that, etc) with this similarity.
For more information see: A nonparametric term weighting method for information retrieval based on measuring the divergence from independence
- See Also:
IndependenceStandardized
,IndependenceSaturated
,IndependenceChiSquared
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from class org.apache.lucene.search.similarities.Similarity
Similarity.SimScorer
-
-
Constructor Summary
Constructors Constructor Description DFISimilarity(Independence independenceMeasure)
Create DFI with the specified divergence from independence measure and using default discountOverlaps valueDFISimilarity(Independence independenceMeasure, boolean discountOverlaps)
Create DFI with the specified parameters
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected Explanation
explain(BasicStats stats, Explanation freq, double docLen)
Explains the score.Independence
getIndependence()
Returns the measure of independenceprotected double
score(BasicStats stats, double freq, double docLen)
Scores the documentdoc
.String
toString()
Subclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.-
Methods inherited from class org.apache.lucene.search.similarities.SimilarityBase
explain, fillBasicStats, log2, newStats, scorer
-
Methods inherited from class org.apache.lucene.search.similarities.Similarity
computeNorm, getDiscountOverlaps
-
-
-
-
Constructor Detail
-
DFISimilarity
public DFISimilarity(Independence independenceMeasure)
Create DFI with the specified divergence from independence measure and using default discountOverlaps value- Parameters:
independenceMeasure
- measure of divergence from independence
-
DFISimilarity
public DFISimilarity(Independence independenceMeasure, boolean discountOverlaps)
Create DFI with the specified parameters- Parameters:
independenceMeasure
- measure of divergence from independencediscountOverlaps
- true if overlap tokens should not impact document length for scoring.
-
-
Method Detail
-
score
protected double score(BasicStats stats, double freq, double docLen)
Description copied from class:SimilarityBase
Scores the documentdoc
.Subclasses must apply their scoring formula in this class.
- Specified by:
score
in classSimilarityBase
- Parameters:
stats
- the corpus level statistics.freq
- the term frequency.docLen
- the document length.- Returns:
- the score.
-
getIndependence
public Independence getIndependence()
Returns the measure of independence
-
explain
protected Explanation explain(BasicStats stats, Explanation freq, double docLen)
Description copied from class:SimilarityBase
Explains the score. The implementation here provides a basic explanation in the format score(name-of-similarity, doc=doc-id, freq=term-frequency), computed from:, and attaches the score (computed via theSimilarityBase.score(BasicStats, double, double)
method) and the explanation for the term frequency. Subclasses content with this format may add additional details inSimilarityBase.explain(List, BasicStats, double, double)
.- Overrides:
explain
in classSimilarityBase
- Parameters:
stats
- the corpus level statistics.freq
- the term frequency and its explanation.docLen
- the document length.- Returns:
- the explanation.
-
toString
public String toString()
Description copied from class:SimilarityBase
Subclasses must override this method to return the name of the Similarity and preferably the values of parameters (if any) as well.- Specified by:
toString
in classSimilarityBase
-
-