Package org.apache.nutch.indexer.geoip
Class GeoIPIndexingFilter
- java.lang.Object
-
- org.apache.nutch.indexer.geoip.GeoIPIndexingFilter
-
- All Implemented Interfaces:
Configurable
,IndexingFilter
,Pluggable
public class GeoIPIndexingFilter extends Object implements IndexingFilter
This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.The third party library distribution provides an API for the GeoIP2 Precision web services and databases. The API also works with the free GeoLite2 databases.
Depending on the service level agreement, you have with the GeoIP service provider, the plugin can add a number of the following fields to the index data model:
- Continent
- Country
- Regional Subdivision
- City
- Postal Code
- Latitude/Longitude
- ISP/Organization
- AS Number
- Confidence Factors
- Radius
- User Type
Some of the services are documented at the GeoIP2 Precision Services webpage where more information can be obtained.
You should also consult the following three properties in
nutch-site.xml
<!-- index-geoip plugin properties --> <property> <name>index.geoip.usage</name> <value>insightsService</value> <description> A string representing the information source to be used for GeoIP information association. Either enter 'cityDatabase', 'connectionTypeDatabase', 'domainDatabase', 'ispDatabase' or 'insightsService'. If you wish to use any one of the Database options, you should make one of GeoIP2-City.mmdb, GeoIP2-Connection-Type.mmdb, GeoIP2-Domain.mmdb or GeoIP2-ISP.mmdb files respectively available on the Hadoop classpath and available at runtime. This can be achieved by adding it to `$NUTCH_HOME/conf`. Alternatively, also the GeoLite2 IP databases (GeoLite2-*.mmdb) can be used. </description> </property> <property> <name>index.geoip.userid</name> <value></value> <description> The userId associated with the GeoIP2 Precision Services account. </description> </property> <property> <name>index.geoip.licensekey</name> <value></value> <description> The license key associated with the GeoIP2 Precision Services account. </description> </property>
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description GeoIPIndexingFilter()
Default constructor for this plugin
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description NutchDocument
filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a parse.Configuration
getConf()
void
setConf(Configuration conf)
-
-
-
Method Detail
-
getConf
public Configuration getConf()
- Specified by:
getConf
in interfaceConfigurable
- See Also:
Configurable.getConf()
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interfaceConfigurable
- See Also:
Configurable.setConf(org.apache.hadoop.conf.Configuration)
-
filter
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
Description copied from interface:IndexingFilter
Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.- Specified by:
filter
in interfaceIndexingFilter
- Parameters:
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the page (fetch datum from segment containing fetch status and fetch time)inlinks
- page inlinks- Returns:
- modified (or a new) document instance, or null (meaning the document should be discarded)
- Throws:
IndexingException
- if an error occurs during during filtering- See Also:
IndexingFilter.filter(org.apache.nutch.indexer.NutchDocument, org.apache.nutch.parse.Parse, org.apache.hadoop.io.Text, org.apache.nutch.crawl.CrawlDatum, org.apache.nutch.crawl.Inlinks)
-
-