YARN example

YARN / MapReduce

Three "simple" examples of running MapReduce jobs available on the platform :

  1. Calculation of the Pi number
  2. Counting of words
  3. Search in Wikipedia

1. Calculation of the Pi number

[xxxx@osirim-hadoop ~]$ hadoop jar /usr/hdp/ pi 10 1000
Number of Maps = 10
Samples per Map = 1000
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #8
Wrote input for Map #9
Starting Job
Job Finished in 37.872 seconds
Estimated value of Pi is 3.14080000000000000000


2. Counting of words

Creation of the data directory input "in"
[xxxx@osirim-hadoop ~]$ hadoop fs -mkdir in 
Filing the /etc/hosts file in the previously created "in" directory
[xxxx@osirim-hadoop ~]$ hadoop fs -put -f /etc/hosts in 
Showing the contents of the "in" directory previously created
[xxxx@osirim-hadoop ~]$ hadoop fs -ls in 
Found 1 items
-rw-r--r-- xxxxx systeme   1054   2016-12-12 15:13 in/hosts
Running the wordcount function of the hadoop-mapreduce-examples.jar program
[xxxx@osirim-hadoop ~]$ hadoop jar /usr/hdp/ wordcount in out
16/01/18 14:44:01 INFO impl.TimelineClientImpl: Timeline service address:
16/01/18 14:44:01 INFO client.RMProxy: Connecting to ResourceManager at co2-hdp-
16/01/18 14:44:03 INFO input.FileInputFormat: Total input paths to process : 1
File Input Format Counters
Bytes Read=1054
File Output Format Counters
Bytes Written=1122
Displaying the contents of the output directory "out"
[xxxx@osirim-hadoop ~]$ hadoop fs -ls out
Found 2 items
-rw-r--r-- 1 xxxxx systeme 0 2016-12-12 15:17 out/_SUCCESS
-rw-r--r-- 1 xxxxx systeme 1122 2016-12-12 15:17 out/part-r-00000
Displaying the contents of the output file "part-r-00000"
[xxxx@osirim-hadoop ~]$ hadoop fs -cat out/part-r-00000 1 1 1 1
localhost.localdomain 2
localhost4 1
localhost4.localdomain4 1
localhost6 1
localhost6.localdomain6 1


3. Search in Wikipedia

An interest of the storage array is its ability to provide data access according to different protocols: SMB, NFS, HDFS, ...
It is thus possible to transfer data to the array in traditional manner in file directories, for example via FTP or HTTP.
MapReduce or other treatments allow access to this data in HDFS.
In this example, the textual content of the wikipedia site is downloaded.
MapReduce by grep is then applied to find occurrences of a word or group of words.
Regular expressions can be used in the search.
Downloading the enwiki-latest-pages-articles.xml.bz2 file located at http://dumps.wikipedia.org/enwiki/latest
[xxxx@osirim-hadoop ~]$ cd /projets/test
[xxxx@osirim-hadoop ~]$ wget http://dumps.wikipedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Launching grep processing on this file to extract sentences containing the phrase "Big Data"
[xxxx@osirim-hadoop ~]$ hadoop jar /usr/hdp/ grep /projets/test/projets/test/wikigrep "Big Data"
Displaying the result in the file /projets/test/wikigrep/part-r-00000
[xxxx@osirim-hadoop ~]$ hadoop fs -cat /projets/test/wikigrep/part-r-00000