Nutch学习记录：Index

#Nutch学习记录：Index

索引这部分的操作相对比较简单，应用Lucene提供的接口实现索引的功能。在进行了网页数据库和链接数据库的更新之后，crawl程序跳出循环，进行爬行阶段的最后一项工作——索引。索引这一过程在crawl的入口是：

1	indexer.index(indexes, crawlDb, linkDb, Arrays.asList(HadoopFSUtil.getPaths(fstats)));

该方法位于org.apache.nutch.indexer中，功能是配置并启动索引阶段的MapReduce任务。该方法调用initMRJob方法对MapReduce任务进行配置，配置的内容为：输入路径（每轮爬行中保存的爬行和解析目录，链接数据库目录），Mapper和Reducer所在的类（IndexerMapReduce），输出格式（IndexerOutputFormat），输出键值对的类型（<Text, NutchWritable>）。
在initMRJob中还对写入索引的域进行了配置，之后启动MapReduce任务。

Mapper阶段

public void map(Text key, Writable value,
      OutputCollector<Text, NutchWritable> output, Reporter reporter) throws IOException {
    output.collect(key, new NutchWritable(value));
  }

Map的内容很简单，就是将输入原样输出，但是以url作为键分发到reducer上，这样，几种不同输入目录中同一个url对应的不同类型的数据就分发到同一个reducer上了。

Reducer阶段

public void reduce(Text key, Iterator<NutchWritable> values, OutputCollector<Text, NutchDocument> output, Reporter reporter)
throws IOException {
while (values.hasNext()) {   // 判别类型,values 中有对应于同一url的几种数据
      final Writable value = values.next().get(); // unwrap
      if (value instanceof Inlinks) {
        inlinks = (Inlinks)value;
      } else if (value instanceof CrawlDatum) {
        ......
}   //end of while

reduce部分首先将收集的值判断所属类型，由于输入时values中有各种类型的值，因此要分门别类的存储，以便后面的使用，随后，将存储的值填入索引中：

// add segment, used to map from merged index back to segment files
    doc.add("segment", metadata.get(Nutch.SEGMENT_NAME_KEY));
    // add digest, used by dedup
	doc.add("digest", metadata.get(Nutch.SIGNATURE_KEY));
	......
	doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);
	boost = this.scfilters.indexerScore(key, doc, dbDatum, fetchDatum, parse, inlinks, boost);
	doc.add("boost", Float.toString(boost));
	output.collect(key, doc);

调用index的filter向索引中加入其它关心的域，默认采用BasicIndexing作为filter，该类的filter方法将一些基本的field，如title等加入到doc中。调用scroing filter的indexerScore方法计算索引阶段的得分，并将此得分写入boost值，并写入索引。

最后，收集<url, doc>键值对。

索引阶段结束。