Package org.apache.nutch.segment
Interface SegmentMergeFilter
-
public interface SegmentMergeFilter
Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.
-
-
Field Summary
Fields Modifier and Type Field Description static String
X_POINT_ID
The name of the extension point.
-
Method Summary
All Methods Instance Methods Abstract Methods Modifier and Type Method Description boolean
filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given key (URL).
-
-
-
Field Detail
-
X_POINT_ID
static final String X_POINT_ID
The name of the extension point.
-
-
Method Detail
-
filter
boolean filter(Text key, CrawlDatum generateData, CrawlDatum fetchData, CrawlDatum sigData, Content content, ParseData parseData, ParseText parseText, Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given key (URL).- Parameters:
key
- the segment record keygenerateData
- directory and data produced by the generation phasefetchData
- directory and data produced by the fetch phasesigData
- directory and data produced by the parse phasecontent
- directory and data produced by the parse phaseparseData
- directory and data produced by the parse phaseparseText
- directory and data produced by the parse phaselinked
- all LINKED values from the latest segment- Returns:
true
values for thiskey
(URL) should be merged into the new segment.
-
-