Organizational Research By

Surprising Reserch Topic questions - Question:how to write Map-Reduce Program?


Hadoop comes with a set of demonstration programs. They are located in ~/hadoop/src/examples/org/apache/hadoop/examples/. One of them is which will automatically compute the word frequency of all text files found in the HDFS directory you ask it to process.
The program has several sections:
The map section
  public static class MapClass extends MapReduceBase
    implements Mapper {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(LongWritable key, Text value, 
                    OutputCollector output, 
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      while (itr.hasMoreTokens()) {
        output.collect(word, one);
The Map class takes lines of text that are fed to it (the text files are automatically broken down into lines by Hadoop--No need for us to do it!), and breaks them into words. Outputs a datagram for each word that is a ( String, int ) tuple, of the form ( "some-word", 1), since each tuple corresponds to the first occurence of each word, so the initial frequency for each word is 1.
The reduce section
 public static class Reduce extends MapReduceBase
    implements Reducer {
    public void reduce(Text key, Iterator values,
                       OutputCollector output, 
                       Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum +=;
      output.collect(key, new IntWritable(sum));
The reduce section gets collections of datagrams of the form [( word, n1 ), (word, n2)...] where all the words are the same, but with different numbers. These collections are the result of a sorting process that is integral to Hadoop and which gathers all the datagrams with the same word together. The reduce process gathers the datagrams inside a datanode, and also gathers datagrams from the different datanodes into a final collection of datagrams where all the words are now unique, with their total frequency (number of occurrences).
The map-reduce organization section
Here we see that the combining stage and the reduce stage are implemented by the same reduce class, which makes sense, since the number of occurrences of a word as generated on several datanodes is just the sum of the numbers of occurrences.
The datagram definitions
    // the keys are words (strings)
    // the values are counts (ints)
As the documentation indicates, the datagrams are of the form (String, int).
Running WordCound
Run the wordcount java program from the example directory in hadoop:
hadoop@hadoop102:~/352/dft$ hadoop jar /Users/hadoop/hadoop/hadoop-0.19.2-examples.jar wordcount dft dft-output
The program takes about 40 seconds to execute on the cluster. The output generated will look something like this:
10/12/16 11:03:52 INFO mapred.FileInputFormat: Total input paths to process : 1
10/12/16 11:03:52 INFO mapred.JobClient: Running job: job_201012161018_0003
10/12/16 11:03:53 INFO mapred.JobClient:  map 0% reduce 0%
10/12/16 11:03:57 INFO mapred.JobClient:  map 1% reduce 0%
10/12/16 11:04:02 INFO mapred.JobClient:  map 10% reduce 0%
10/12/16 11:04:07 INFO mapred.JobClient:  map 21% reduce 0%
10/12/16 11:04:11 INFO mapred.JobClient:  map 31% reduce 0%
10/12/16 11:04:15 INFO mapred.JobClient:  map 41% reduce 0%
10/12/16 11:04:19 INFO mapred.JobClient:  map 52% reduce 0%
10/12/16 11:04:24 INFO mapred.JobClient:  map 65% reduce 0%
10/12/16 11:04:29 INFO mapred.JobClient:  map 78% reduce 0%
10/12/16 11:04:33 INFO mapred.JobClient:  map 89% reduce 0%
10/12/16 11:04:38 INFO mapred.JobClient:  map 100% reduce 0%
10/12/16 11:04:39 INFO mapred.JobClient: Job complete: job_201012161018_0003
10/12/16 11:04:39 INFO mapred.JobClient: Counters: 8
10/12/16 11:04:39 INFO mapred.JobClient:   File Systems
10/12/16 11:04:39 INFO mapred.JobClient:     HDFS bytes read=3529994
10/12/16 11:04:39 INFO mapred.JobClient:     HDFS bytes written=887496
10/12/16 11:04:39 INFO mapred.JobClient:   Job Counters 
10/12/16 11:04:39 INFO mapred.JobClient:     Rack-local map tasks=685
10/12/16 11:04:39 INFO mapred.JobClient:     Launched map tasks=863
10/12/16 11:04:39 INFO mapred.JobClient:     Data-local map tasks=178
10/12/16 11:04:39 INFO mapred.JobClient:   Map-Reduce Framework
10/12/16 11:04:39 INFO mapred.JobClient:     Map input records=33055
10/12/16 11:04:39 INFO mapred.JobClient:     Map input bytes=1573044
10/12/16 11:04:39 INFO mapred.JobClient:     Map output records=267975

asked Sep 13, 2013 in Hadoop by anonymous
edited Sep 12, 2013
0 votes

Related Hot Questions

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
Anti-spam verification:
To avoid this verification in future, please log in or register.