Following are the key steps of how Hadoop MapReduce works in a word count problem:
- Input is fed to a program, say a RecordReader, that reads data line-by-line or record-by-record.
- Mapping process starts which includes following steps:
- Combining: Combines the data (word) with its count such as 1
- Partitioning: Creates one partition for each word occurence
- Shuffling: Move words to right partition
- Sorting: Sort the partition by word
- Last step is Reducing which comes up with the result such as word count for each occurence of word.
Following diagram represents above steps.
- Map: This phase processes data in form of key-value pairs
- Partitioning/Shuffling/Sorting: This groups similar keys together and sort them
- Reduce: This places final result with the key.