Hadoop Quick Notes :: Part

On hadoop-master, open a command terminal window and issue the following command:

start-dfs.sh

The following Figure-1 illustrates the execution of the above command in the terminal window:

Figure-1

Next, issue the following command:

start-mapred.sh

The following Figure-2 illustrates the execution of the above command in the terminal window:

Figure-2

To verify all the necessary processes are started, issue the following command:

jps

The output of the above command should look like the one from Figure-3 below:

Figure-3

On hadoop-slave-1 (or hadoop-slave-2), open a command terminal window and issue the following command:

jps

The output of the above command should look like the one from Figure-4 below:

Figure-4

The following diagram in Figure-5 illustrates our 3-node Hadoop cluster:

Figure-5

In this section we will elaborate on the core modules and components of Hadoop.

The core of Hadoop is made up of the following two modules:

Hadoop Distributed File System (HDFS)

Is a distributed user-level file system for storing very large data sets typically in terabytes, petabytes, or exabytes
Data sets are broken into large chunks (data block of fixed size of 64MB or more - default is 64MB) and spread across multiple nodes in a cluster
Data block size can be configured by specifying value in bytes for the property dfs.blocksize in the file conf/hdfs-site.xml
Each chunk (data block) is also replicated to the other nodes (default is 3) of the cluster providing for data redundancy and fault tolerance
Replication factor can be configured by specifying value for the property dfs.replication in the file conf/hdfs-site.xml
Suitable for read intensive applications that perform long sequential streaming reads
Has three core components: the NameNode, the DataNode, and the SecondaryNameNode
NameNode is the nervous system of Hadoop. It keeps track of the file names, their permissions, and the locations (which nodes of the cluster) of each data block of each file. This information is stored in metadata file(s). It is the single point of failure for the Hadoop system. If we loose the NameNode, everything will come to a stand-still. Typically, it is run in a separate node with lots of RAM as it stores the metadata for each of the files in HDFS in-memory. This is the reason we ran the NameNode on the hadoop-master node
By keeping the data block size large (64MB or more) in HDFS, the NameNode is able to save on the amount of space needed to store metadata for each file. Also, this results in the data in the chunks to be placed next to each other sequentially in the physical disk (note that the physical disk block size is much smaller - 4K or 8K) and allows for fast streaming reads with optimal IO seeks
The NameNode metadata consists of two sets of files. The FSImage file and the EditLog files. FSImage contains the snapshot of the filesystem at the time the NameNode is started. EditLog span multiple files and contain the sequence of changes made to the filesystem after the NameNode started
DataNode is where the data blocks of the files in HDFS reside. It is the workhorse of HDFS and performs all the read and write operations. It constantly communicates with the NameNode to report on the list of data blocks it is storing. In addition, it also communicates with the other DataNodes in the cluster for data block replication. Typically, it runs on the slave nodes (like hadoop-slave-1 and hadoop-slave-2 of our cluster
Uses a master-slave architecture with the NameNode as the master and the DataNodes as the slaves. As a master the NameNode manages the file system namespace, the metadata about the files in HDFS and regulates access to the files in HDFS. On the other hand, as a slave a DataNode manages the storage attached to the node on which it is running and is responsible for block creation, deletion, and replication upon instruction from the master
SecondaryNameNode is responsible for periodically taking a snapshot of the metadata file(s) from the NameNode, merge them and load the merged version back to the NameNode. The name is a bit misleading - it is really not a backup for the NameNode. Typically, it is run in a separate node and hence the reason we ran it on the hadoop-master node. In other words, the changes in EditLog files are merged into the FSImage file and uploaded back to the NameNode. Since the NameNode is rarely restarted in a real production cluster, the EditLog files can grow bigger and hence the need to merge them

Hadoop MapReduce

Is a programming model that is targeted for processing large data sets in parallel by dividing a work into a set of independent tasks which can be executed on the slave nodes of the cluster
The unit of work is divided into two phases: a Map phase and a Reduce phase
During the Map phase, the input data is divided into input splits for processing by the Map tasks running in parallel on the various nodes of the cluster. In other words, the Map task performs filtering and transformation on the input data
During the Reduce phase, the output from the Map tasks is processed by the Reduce tasks running in parallel on the various nodes of the cluster. In other words, the Reduce task performs aggregation on the output data from the Map tasks
Has two core components: the JobTracker and the TaskTracker
When a unit of work (also known as a Job) is submitted by a client for processing, it goes through the JobTracker. The JobTracker first consults with the NameNode to figure the locations (nodes) of the data blocks for the input files to be processed and then assigns the Map and Reduce tasks to the nodes that have the data blocks. The JobTracker monitors the progress and re-submits any failed tasks. Like the NameNode, it is the single point of failure in the Hadoop system. If we loose the JobTracker, no jobs can be executed. Typically, it is run on a separate node and hence the reason we ran the JobTracker on the hadoop-master node.
The TaskTracker is the workhorse of the Hadoop MapReduce system and executes the Map and Reduce tasks assigned by the JobTracker. It constantly communicates with the JobTracker to report status of the task(s) it is executing. Typically, it runs on the slave nodes (like hadoop-slave-1 and hadoop-slave-2 of our cluster)

Output-1

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  mradmin              run a Map-Reduce admin client
  fsck                 run a DFS filesystem checking utility
  fs                   run a generic filesystem user client
  balancer             run a cluster balancing utility
  oiv                  apply the offline fsimage viewer to an fsimage
  fetchdt              fetch a delegation token from the NameNode
  jobtracker           run the MapReduce job Tracker node
  pipes                run a Pipes job
  tasktracker          run a MapReduce task Tracker node
  historyserver        run job history servers as a standalone daemon
  job                  manipulate MapReduce jobs
  queue                get information regarding JobQueues
  version              print the version
  jar <jar>            run a jar file
  distcp <srcurl> <desturl> copy file or directories recursively
  distcp2 <srcurl> <desturl> DistCp version 2
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

Output-2

hadoop fs is the command to execute fs commands. The full syntax is: 

hadoop fs [-fs <local | file system URI>] [-conf <configuration file>]
	[-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>]
	[-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm [-skipTrash] <src>]
	[-rmr [-skipTrash] <src>] [-put <localsrc> ... <dst>] [-copyFromLocal <localsrc> ... <dst>]
	[-moveFromLocal <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst>
	[-getmerge <src> <localdst> [addnl]] [-cat <src>]
	[-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal <src> <localdst>]
	[-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>]
	[-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>]
	[-tail [-f] <path>] [-text <path>]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-chgrp [-R] GROUP PATH...]
	[-count[-q] <path>]
	[-help [cmd]]

-fs [local | <file system URI>]: 	Specify the file system to use.
		If not specified, the current configuration is used, 
		taken from the following, in increasing precedence: 
			core-default.xml inside the hadoop jar file 
			core-site.xml in $HADOOP_CONF_DIR 
		'local' means use the local file system as your DFS. 
		<file system URI> specifies a particular file system to 
		contact. This argument is optional but if used must appear
		appear first on the command line.  Exactly one additional
		argument must be specified. 

-ls <path>: 	List the contents that match the specified file pattern. If
		path is not specified, the contents of /user/<currentUser>
		will be listed. Directory entries are of the form 
			dirName (full path) <dir> 
		and file entries are of the form 
			fileName(full path) <r n> size 
		where n is the number of replicas specified for the file 
		and size is the size of the file, in bytes.

-lsr <path>: 	Recursively list the contents that match the specified
		file pattern.  Behaves very similarly to hadoop fs -ls,
		except that the data is shown for all the entries in the
		subtree.

-du <path>: 	Show the amount of space, in bytes, used by the files that 
		match the specified file pattern.  Equivalent to the unix
		command "du -sb <path>/*" in case of a directory, 
		and to "du -b <path>" in case of a file.
		The output is in the form 
			name(full path) size (in bytes)

-dus <path>: 	Show the amount of space, in bytes, used by the files that 
		match the specified file pattern.  Equivalent to the unix
		command "du -sb"  The output is in the form 
			name(full path) size (in bytes)

-mv <src> <dst>:   Move files that match the specified file pattern <src>
		to a destination <dst>.  When moving multiple files, the 
		destination must be a directory. 

-cp <src> <dst>:   Copy files that match the file pattern <src> to a 
		destination.  When copying multiple files, the destination
		must be a directory. 

-rm [-skipTrash] <src>: 	Delete all files that match the specified file pattern.
		Equivalent to the Unix command "rm <src>"
		-skipTrash option bypasses trash, if enabled, and immediately
deletes <src>
-rmr [-skipTrash] <src>: 	Remove all directories which match the specified file 
		pattern. Equivalent to the Unix command "rm -rf <src>"
		-skipTrash option bypasses trash, if enabled, and immediately deletes <src>

-put <localsrc> ... <dst>: 	Copy files from the local file system 
		into fs. 

-copyFromLocal <localsrc> ... <dst>: Identical to the -put command.

-moveFromLocal <localsrc> ... <dst>: Same as -put, except that the source is
		deleted after it's copied.

-get [-ignoreCrc] [-crc] <src> <localdst>:  Copy files that match the file pattern <src> 
		to the local name.  <src> is kept.  When copying mutiple, 
		files, the destination must be a directory. 

-getmerge <src> <localdst>:  Get all the files in the directories that 
		match the source file pattern and merge and sort them to only
		one file on local fs. <src> is kept.

-cat <src>: 	Fetch all files that match the file pattern <src> 
		and display their content on stdout.

-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>:  Identical to the -get command.

-moveToLocal <src> <localdst>:  Not implemented yet 

-mkdir <path>: 	Create a directory in specified location. 

-setrep [-R] [-w] <rep> <path/file>:  Set the replication level of a file. 
		The -R flag requests a recursive change of replication level 
		for an entire tree.

-tail [-f] <file>:  Show the last 1KB of the file. 
		The -f option shows apended data as the file grows. 

-touchz <path>: Write a timestamp in yyyy-MM-dd HH:mm:ss format
		in a file at <path>. An error is returned if the file exists with non-zero length

-test -[ezd] <path>: If file { exists, has zero length, is a directory
		then return 0, else return 1.

-text <src>: 	Takes a source file and outputs the file in text format.
		The allowed formats are zip and TextRecordInputStream.

-stat [format] <path>: Print statistics about the file/directory at <path>
		in the specified format. Format accepts filesize in blocks (%b), filename (%n),
		block size (%o), replication (%r), modification date (%y, %Y)

-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...
		Changes permissions of a file.
		This works similar to shell's chmod with a few exceptions.

	-R	modifies the files recursively. This is the only option
		currently supported.

	MODE	Mode is same as mode used for chmod shell command.
		Only letters recognized are 'rwxX'. E.g. a+r,g-w,+rwx,o=r

	OCTALMODE Mode specifed in 3 digits. Unlike shell command,
		this requires all three digits.
		E.g. 754 is same as u=rwx,g=rx,o=r

		If none of 'augo' is specified, 'a' is assumed and unlike
		shell command, no umask is applied.

-chown [-R] [OWNER][:[GROUP]] PATH...
		Changes owner and group of a file.
		This is similar to shell's chown with a few exceptions.

	-R	modifies the files recursively. This is the only option
		currently supported.

		If only owner or group is specified then only owner or
		group is modified.

		The owner and group names may only cosists of digits, alphabet,
		and any of '-_.@/' i.e. [-_.@/a-zA-Z0-9]. The names are case
		sensitive.

		WARNING: Avoid using '.' to separate user name and group though
		Linux allows it. If user names have dots in them and you are
		using local file system, you might see surprising results since
		shell command 'chown' is used for local files.

-chgrp [-R] GROUP PATH...
		This is equivalent to -chown ... :GROUP ...

-count[-q] <path>: Count the number of directories, files and bytes under the paths
		that match the specified file pattern.  The output columns are:
		DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME or
		QUOTA REMAINING_QUATA SPACE_QUOTA REMAINING_SPACE_QUOTA 
		      DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
-help [cmd]: 	Displays help for given command or all commands if none
		is specified.

Output-7

-rw-rw-r-- 1 hadoop hadoop   43398 Jul 22 18:26 hadoop-1.2.1/lib/asm-3.2.jar
-rw-rw-r-- 1 hadoop hadoop  116219 Jul 22 18:26 hadoop-1.2.1/lib/aspectjrt-1.6.11.jar
-rw-rw-r-- 1 hadoop hadoop 8918431 Jul 22 18:26 hadoop-1.2.1/lib/aspectjtools-1.6.11.jar
-rw-rw-r-- 1 hadoop hadoop  188671 Jul 22 18:26 hadoop-1.2.1/lib/commons-beanutils-1.7.0.jar
-rw-rw-r-- 1 hadoop hadoop  206035 Jul 22 18:26 hadoop-1.2.1/lib/commons-beanutils-core-1.8.0.jar
-rw-rw-r-- 1 hadoop hadoop   41123 Jul 22 18:26 hadoop-1.2.1/lib/commons-cli-1.2.jar
-rw-rw-r-- 1 hadoop hadoop   58160 Jul 22 18:26 hadoop-1.2.1/lib/commons-codec-1.4.jar
-rw-rw-r-- 1 hadoop hadoop  575389 Jul 22 18:26 hadoop-1.2.1/lib/commons-collections-3.2.1.jar
-rw-rw-r-- 1 hadoop hadoop  298829 Jul 22 18:26 hadoop-1.2.1/lib/commons-configuration-1.6.jar
-rw-rw-r-- 1 hadoop hadoop   13619 Jul 22 18:26 hadoop-1.2.1/lib/commons-daemon-1.0.1.jar
-rw-rw-r-- 1 hadoop hadoop  143602 Jul 22 18:26 hadoop-1.2.1/lib/commons-digester-1.8.jar
-rw-rw-r-- 1 hadoop hadoop  112341 Jul 22 18:26 hadoop-1.2.1/lib/commons-el-1.0.jar
-rw-rw-r-- 1 hadoop hadoop  279781 Jul 22 18:26 hadoop-1.2.1/lib/commons-httpclient-3.0.1.jar
-rw-rw-r-- 1 hadoop hadoop  163151 Jul 22 18:26 hadoop-1.2.1/lib/commons-io-2.1.jar
-rw-rw-r-- 1 hadoop hadoop  261809 Jul 22 18:26 hadoop-1.2.1/lib/commons-lang-2.4.jar
-rw-rw-r-- 1 hadoop hadoop   60686 Jul 22 18:26 hadoop-1.2.1/lib/commons-logging-1.1.1.jar
-rw-rw-r-- 1 hadoop hadoop   26202 Jul 22 18:26 hadoop-1.2.1/lib/commons-logging-api-1.0.4.jar
-rw-rw-r-- 1 hadoop hadoop  832410 Jul 22 18:26 hadoop-1.2.1/lib/commons-math-2.1.jar
-rw-rw-r-- 1 hadoop hadoop  273370 Jul 22 18:26 hadoop-1.2.1/lib/commons-net-3.1.jar
-rw-rw-r-- 1 hadoop hadoop 3566844 Jul 22 18:26 hadoop-1.2.1/lib/core-3.1.1.jar
-rw-rw-r-- 1 hadoop hadoop   58461 Jul 22 18:26 hadoop-1.2.1/lib/hadoop-capacity-scheduler-1.2.1.jar
-rw-rw-r-- 1 hadoop hadoop   70409 Jul 22 18:26 hadoop-1.2.1/lib/hadoop-fairscheduler-1.2.1.jar
-rw-rw-r-- 1 hadoop hadoop   10443 Jul 22 18:26 hadoop-1.2.1/lib/hadoop-thriftfs-1.2.1.jar
-rw-rw-r-- 1 hadoop hadoop  706710 Jul 22 18:26 hadoop-1.2.1/lib/hsqldb-1.8.0.10.jar
-rw-rw-r-- 1 hadoop hadoop  227500 Jul 22 18:26 hadoop-1.2.1/lib/jackson-core-asl-1.8.8.jar
-rw-rw-r-- 1 hadoop hadoop  668564 Jul 22 18:26 hadoop-1.2.1/lib/jackson-mapper-asl-1.8.8.jar
-rw-rw-r-- 1 hadoop hadoop  405086 Jul 22 18:26 hadoop-1.2.1/lib/jasper-compiler-5.5.12.jar
-rw-rw-r-- 1 hadoop hadoop   76698 Jul 22 18:26 hadoop-1.2.1/lib/jasper-runtime-5.5.12.jar
-rw-rw-r-- 1 hadoop hadoop  220920 Jul 22 18:26 hadoop-1.2.1/lib/jdeb-0.8.jar
-rw-rw-r-- 1 hadoop hadoop  458233 Jul 22 18:26 hadoop-1.2.1/lib/jersey-core-1.8.jar
-rw-rw-r-- 1 hadoop hadoop  147933 Jul 22 18:26 hadoop-1.2.1/lib/jersey-json-1.8.jar
-rw-rw-r-- 1 hadoop hadoop  694352 Jul 22 18:26 hadoop-1.2.1/lib/jersey-server-1.8.jar
-rw-rw-r-- 1 hadoop hadoop  321806 Jul 22 18:26 hadoop-1.2.1/lib/jets3t-0.6.1.jar
-rw-rw-r-- 1 hadoop hadoop  539912 Jul 22 18:26 hadoop-1.2.1/lib/jetty-6.1.26.jar
-rw-rw-r-- 1 hadoop hadoop  177131 Jul 22 18:26 hadoop-1.2.1/lib/jetty-util-6.1.26.jar
-rw-rw-r-- 1 hadoop hadoop  185746 Jul 22 18:26 hadoop-1.2.1/lib/jsch-0.1.42.jar
-rw-rw-r-- 1 hadoop hadoop  198945 Jul 22 18:26 hadoop-1.2.1/lib/junit-4.5.jar
-rw-rw-r-- 1 hadoop hadoop   11428 Jul 22 18:26 hadoop-1.2.1/lib/kfs-0.2.2.jar
-rw-rw-r-- 1 hadoop hadoop  391834 Jul 22 18:26 hadoop-1.2.1/lib/log4j-1.2.15.jar
-rw-rw-r-- 1 hadoop hadoop 1419869 Jul 22 18:26 hadoop-1.2.1/lib/mockito-all-1.8.5.jar
-rw-rw-r-- 1 hadoop hadoop   65261 Jul 22 18:26 hadoop-1.2.1/lib/oro-2.0.8.jar
-rw-rw-r-- 1 hadoop hadoop  134133 Jul 22 18:26 hadoop-1.2.1/lib/servlet-api-2.5-20081211.jar
-rw-rw-r-- 1 hadoop hadoop   15345 Jul 22 18:26 hadoop-1.2.1/lib/slf4j-api-1.4.3.jar
-rw-rw-r-- 1 hadoop hadoop    8601 Jul 22 18:26 hadoop-1.2.1/lib/slf4j-log4j12-1.4.3.jar
-rw-rw-r-- 1 hadoop hadoop   15010 Jul 22 18:26 hadoop-1.2.1/lib/xmlenc-0.52.jar

Output-1

Output-2

Output-3

Output-4

Output-5

Output-6

Output-7

Output-8

Output-9

Output-10