Package org.apache.nutch.scoring.link
Class LinkAnalysisScoringFilter
- java.lang.Object
-
- org.apache.nutch.scoring.AbstractScoringFilter
-
- org.apache.nutch.scoring.link.LinkAnalysisScoringFilter
-
- All Implemented Interfaces:
Configurable
,Pluggable
,ScoringFilter
public class LinkAnalysisScoringFilter extends AbstractScoringFilter
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.scoring.ScoringFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description LinkAnalysisScoringFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description float
generatorSortValue(Text url, CrawlDatum datum, float initSort)
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.float
indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
This method calculates a indexed document score/boost.void
initialScore(Text url, CrawlDatum datum)
Set an initial score for newly discovered pages.void
passScoreAfterParsing(Text url, Content content, Parse parse)
Currently a part of score distribution is performed using only data coming from the parsing process.void
passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it intoContent
metadata.void
setConf(Configuration conf)
-
Methods inherited from class org.apache.nutch.scoring.AbstractScoringFilter
distributeScoreToOutlinks, getConf, injectedScore, updateDbScore
-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.nutch.scoring.ScoringFilter
orphanedScore
-
-
-
-
Method Detail
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
- Overrides:
setConf
in classAbstractScoringFilter
-
generatorSortValue
public float generatorSortValue(Text url, CrawlDatum datum, float initSort) throws ScoringFilterException
Description copied from interface:ScoringFilter
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.- Specified by:
generatorSortValue
in interfaceScoringFilter
- Overrides:
generatorSortValue
in classAbstractScoringFilter
- Parameters:
url
- url of the pagedatum
- page's datum, should not be modifiedinitSort
- initial sort value, or a value from previous filters in chain- Returns:
- a sort value for use in sorting and selecting the top N scoring pages during fetchlist generation
- Throws:
ScoringFilterException
- if there is a fatal error preparing the sort value
-
indexerScore
public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) throws ScoringFilterException
Description copied from interface:ScoringFilter
This method calculates a indexed document score/boost.- Specified by:
indexerScore
in interfaceScoringFilter
- Overrides:
indexerScore
in classAbstractScoringFilter
- Parameters:
url
- url of the pagedoc
- indexed document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.dbDatum
- current page from CrawlDb. NOTE:- changes made to this instance are not persisted
- may be null if indexing is done without CrawlDb or if the segment is generated not from the CrawlDb (via FreeGenerator).
fetchDatum
- datum from FetcherOutput (containing among others the fetching status)parse
- parsing result. NOTE: changes made to this instance are not persisted.inlinks
- current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.initScore
- initial boost value for the indexed document.- Returns:
- boost value for the indexed document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying the indexed document directly.
- Throws:
ScoringFilterException
- if there is a fatal error whilst calculating the indexed document score/boost
-
initialScore
public void initialScore(Text url, CrawlDatum datum) throws ScoringFilterException
Description copied from interface:ScoringFilter
Set an initial score for newly discovered pages. Note: newly discovered pages have at least one inlink with its score contribution, so filter implementations may choose to set initial score to zero (unknown value), and then the inlink score contribution will set the "real" value of the new page.- Specified by:
initialScore
in interfaceScoringFilter
- Overrides:
initialScore
in classAbstractScoringFilter
- Parameters:
url
- url of the pagedatum
- new datum. Filters will modify it in-place.- Throws:
ScoringFilterException
- if there is a fatal error setting an initial score for newly discovered pages
-
passScoreAfterParsing
public void passScoreAfterParsing(Text url, Content content, Parse parse) throws ScoringFilterException
Description copied from interface:ScoringFilter
Currently a part of score distribution is performed using only data coming from the parsing process. We need this method in order to ensure the presence of score data in these steps.- Specified by:
passScoreAfterParsing
in interfaceScoringFilter
- Overrides:
passScoreAfterParsing
in classAbstractScoringFilter
- Parameters:
url
- page urlcontent
- original content. NOTE: modifications to this value are not persisted.parse
- target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.- Throws:
ScoringFilterException
- if there is a fatal error processing score data in subsequent steps after parsing
-
passScoreBeforeParsing
public void passScoreBeforeParsing(Text url, CrawlDatum datum, Content content) throws ScoringFilterException
Description copied from interface:ScoringFilter
This method takes all relevant score information from the current datum (coming from a generated fetchlist) and stores it intoContent
metadata. This is needed in order to pass this value(s) to the mechanism that distributes it to outlinked pages.- Specified by:
passScoreBeforeParsing
in interfaceScoringFilter
- Overrides:
passScoreBeforeParsing
in classAbstractScoringFilter
- Parameters:
url
- url of the pagedatum
- source datum. NOTE: modifications to this value are not persisted.content
- instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.- Throws:
ScoringFilterException
- if there is a fatal error injecting score information from the current datum intoContent
metadata
-
-