MapReduce应用开发基础-WordCount源码分析

要使用Hadoop集群的强大功能，开发MapReduce应用势在必行，虽然Hive, Pig之类可以变通的方式来大大简化MapReduce的使用，但是了解如何开发以及MapReduce基本原理依然非常重要。

以下Hadoop自带示例程序WordCount源码为例进行分析：

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
   
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
     
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write( word, one );
      }
    }
  }
 
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs. length != 2) {
      System. err.println("Usage: wordcount <in> <out>" );
      System.exit(2);
    }
    Job job = new Job(conf, "word count" );
    job.setJarByClass(WordCount. class);
    job.setMapperClass(TokenizerMapper. class);
    job.setCombinerClass(IntSumReducer. class);
    job.setReducerClass(IntSumReducer. class);
    job.setOutputKeyClass(Text. class);
    job.setOutputValueClass(IntWritable. class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

一个完整的MapReduce应用包括两个类，一个是Map，实现数据的读入，一个是Reduce，实现结果的输出。MapReduce应用程序只要定义好Map以及Reduce的逻辑，以及少量的额外代码即可。

Mapper类：Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>，指定输入输出Key/Value的类型。这里使用Object, Text, Text, IntWritable。输入key不关注，直接Object类型；而一行记录的内容，保存在Text类型中，与String类似；而输出的是(单词/计数)key/value对，类型采用Text和IntWritable。

注：使用这些Hadoop定义的类型Text,IntWritable等而不是Java中的String，Integer，其原因，是Hadoop需要将这些内容基于HDFS的分布式存储进行特殊的序列化处理，如果要自己实现，需要实现Writable接口。

content是与MapReduce框架交互的接口，其中包含了输出内容、配置信息、状态信息等。

map方法将输入的key/value进行处理，然后再写入到content中，由content负责后续传递。

Reducer类：reduce方法的输入的key对应map中输出的key，而输入的value，则是map中输出的value的集合。多个map任务的同一个key会汇总到reduce的一个key，而它们的value形成一个集合/列表。经过处理后，再返回一对key/value。

创建Job：

GenericOptionsParser是Hadoop命令行参数的辅助类。
Job job = new Job(conf, “word count” ) 创建Job对象，传入Hadoop相关配置，以及Job名称。
job.setJarByClass通过设定一个类来指定包含这个类的Jar文件。也可以使用setJar()。
job.setMapperClass 设置Mapper类。
job.setCombinerClass Combiner是在Mapper任务处理后，为减少网络交互，将单台主机上的map任务的输出先进行一次合并，即小型的reducer，这里的逻辑下，可以设置为Reducer类相同的处理逻辑。
job.setReducerClass 设置Reducer类。
job.setOutputKeyClass 设置输出的key类型。
job.setOutputValueClass 设置输出的value类型。
FileInputFormat.addInputPath 指定输入文件的路径。
FileOutputFormat.setOutputPath 指定输出目录的路径。
job.waitForCompletion( true) job的一种提交方式，提交后，等待job运行结束，数表示是否打印Job执行的相关信息。返回的结果是一个boolean变量，用来标识Job的执行结果。

一些其他疑问：

1.这里未设置job的InputFormat，那么我们从源码中如何确认TextInputFormat是其默认InputFormat？

job在提交之前，需要设置InputFormat，如另一个示例代码MultiFileWordCount中的如下设置：

//set the InputFormat of the job to our InputFormat
job.setInputFormatClass(MyInputFormat. class);

而这里，我们没有设定InputFormat类型，据称是TextInputFormat，那么我们如何确认这一点呢？

进一步查看Job类(org.apache.hadoop.mapreduce.Job)的源码，发现只有setInputFormatClass这个设置，而没有get方法，而使用者肯定是通过get方法来获取的，再查看Job的父类org.apache.hadoop.mapreduce.task.JobContextImpl，其中找到了如下定义：

  /**
   * Get the {@link InputFormat} class for the job.
   *
   * @return the {@link InputFormat} class for the job.
   */
  @SuppressWarnings( "unchecked")
  public Class<? extends InputFormat<?,?>> getInputFormatClass ()
     throws ClassNotFoundException {
    return (Class<? extends InputFormat<?,?>>)
      conf.getClass( INPUT_FORMAT_CLASS_ATTR, TextInputFormat.class);
  }

可以看到如果conf对象中没有相关定义的话，就默认设置TextInputFormat，这里基本可以确认了。

再进一步，conf源头是WordCount里面的：

Configuration conf = new Configuration();

传递给Job后转化：

  @Deprecated
  public Job(Configuration conf, String jobName) throws IOException {
    this(conf);
    setJobName(jobName);
  }

  Job(JobConf conf) throws IOException {
    super(conf, null);
    // propagate existing user credentials to job
    this. credentials.mergeAll(this.ugi.getCredentials());
    this. cluster = null ;
  }

源头的conf转化为JobConf后传递给父类，也就是上面提到的JobContextImpl，那么再看下JobConf中，发现只有如下两个方法涉及到了InputFormat：

  /**
   * Get the {@link InputFormat} implementation for the map- reduce job,
   * defaults to {@link TextInputFormat} if not specified explicity.
   *
   * @return the {@link InputFormat} implementation for the map- reduce job.
   */
  public InputFormat getInputFormat() {
    return ReflectionUtils.newInstance(getClass("mapred.input.format.class" ,
                                                             TextInputFormat.class ,
                                                             InputFormat.class ),
                                                    this);
  }
  
  /**
   * Set the {@link InputFormat} implementation for the map- reduce job.
   *
   * @param theClass the {@link InputFormat} implementation for the map-reduce
   *                 job.
   */
  public void setInputFormat(Class<? extends InputFormat> theClass) {
    setClass("mapred.input.format.class", theClass, InputFormat. class);
  }

OK，再次明确了InputFormat的默认设置是TextInputFormat。Done！

以上步骤显然可以用于任何类似的疑问的处理上。

2.WordCount的实现，是利用MapReduce原理，将输入文件进行分片，每个分片分别统计(Map)，然后汇总将不同Map中的单词及其统计数进行累积。有这样的担忧存在：如果一个单词被拆分到了两个不同分片甚至不同DataNode中，会不会被统计成两个单词造成最终结果错误呢？

用例子来检测下是否会有这样的错误存在。创建单个字符串，大小超过64MB，上传到HDFS上使用WordCount处理，结果识别到单词数为1，正确！

那么它是如何实现的呢？

回顾下MapReduce任务中的输入处理流程：

指定输入文件路径，如FileInputFormat.addInputPaths(job, args[0])
指定文件的处理类型，如job.setInputFormatClass(MyInputFormat. class)
在这个InputFormatClass内部，考虑:
1. 是否可以进行分片(isSplitable(JobContext context, Path file))
2. 如何分片(List<InputSplit> getSplits(JobContext job))
3. 如何读取分片中记录( RecordReader<LongWritable, Text> createRecordReader(InputSplit split,TaskAttemptContext context))
以上确定后，用户开发的Map任务中就可以直接处理每一条记录KeyValue。

这里的问题就涉及到记录如何读取了，我们已经知道使用的是TextInputFormat，查看其createRecordReader：

  public RecordReader<LongWritable, Text>
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    String delimiter = context.getConfiguration().get(
        "textinputformat.record.delimiter" );
    byte[] recordDelimiterBytes = null ;
    if ( null != delimiter)
      recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
    return new LineRecordReader(recordDelimiterBytes );
  }

最终使用的是org.apache.hadoop.mapreduce.lib.input.LineRecordReader来读取行记录。

LineRecordReader中虽然一次处理一个Split，但是获取的内容可以是下一个Split中，它会确保一条记录（根据你们指定的记录分隔符确定，默认是换行符分隔）总是被前部所属的split处理程序读到。事实上：

1）Split的划分其实是逻辑上的，只是指定了该文件的start和end位置，而不是真实的划分成小文件。

2）每次处理一个Split时，如果该Split不是第一个分片，那么总是忽略第一行记录，因为每个Split完成后总是会多读一行，所以本分片的第一行总是被与上一分片一起处理了：

LineRecordReader的initialize方法中：

    // If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method.
    if ( start != 0) {
      start += in.readLine( new Text(), 0, maxBytesToConsume(start ));
    }

3）分片的第一行总是被与上一分片一起处理，该逻辑：

LineRecordReader的nextKeyValue方法中：

while (getFilePosition() <= end) {
      newSize = in.readLine( value, maxLineLength ,
          Math.max(maxBytesToConsume(pos), maxLineLength));
           ...
}

<=符号意味着本split要处理完成，其读到的文件位置必然是在end之后了，而in.readLine是只管针对文件(而不是针对Split)往下读取一整行，那么这就保证了下一个split的第一行在本split中处理完。

这也就能够解释WordCount程序(所有使用LineRecordReader来读取记录的MapReduce程序)不会错误处理跨行/跨分片的内容。

参考：
http://my.oschina.net/xiangchen/blog/99653
http://www.informit.com/articles/article.aspx?p=2017060
http://blog.csdn.net/derekjiang/article/details/6851625

The post MapReduce应用开发基础-WordCount源码分析 appeared first on SQLParty.

MapReduce应用开发基础-WordCount源码分析

Trending Articles

金士顿V300拆的FT64G08UCM1-27或者FT64G08UCT1-8B用SM3257NEBA主控量产

皇家騎士團1、2，SFC超級任天堂經典SRPG遊戲下載，模擬器+攻略+詳細流程資料+金手指！

【XY】精简中文 23h2 Win11 Pro 22631.4169 x64c 自建账户+内置管理员 24.9.18更新

「圖紙集」+「功能變數」_管理圖號及張號

帳務小管家Life 2024 免安裝中文版 (2024/01/05) - 中文記帳軟體

[攻略] 《魔獸世界》乾了啦！6.2.2 啤酒節新戰寵和玩具已報到

泰语每日一词：ของ“的”，“东西”（Day 252）

cocos creator 3.5.2 與 Android Studio 3.5.2 打包 aab 一直上不了 Google Play store

「青春達人」性別平等教育講座

黑龙江省民代幼教师致省政府的诉求信

活得更真实：10个行动建议重拾幸福人生

[家庭教师.HITMAN REBORN!]音乐全集+手机铃音[度盘下载][3G]

[沸班亚马制作组] 胆大党第一季 - 01-12 [BDRip AI Ultimate 2160p HEVC-10bit OPUS]

出售: 助友放 Guru 金字塔地盒有效提升音響的音色

中／世唯生乳捲超Q彈

S3/U5變速箱CVT 7保養&照顧

[心得] MVPmods版 MVP2015模組中文化測試報告。

【报Bug】AMap com.amap.api:3dmap：请考虑将 SDK 升级到版本 10.0.600。

[DBD-Raws][占领电视台/They wanted to fly far away on the...

[转载]煞貢、直星、人專吉日\金神七煞歌

相关文章:

Trending Articles