Sundaramurthy Blog

March 19, 2010

Hadoop Basic

Filed under: Hadoop — sundar5 @ 6:19 am

I. What is Hadoop?

Hadoop is a file system with java base programming framework in a distributed computing environment.   It is basically distributing   a large data set across multiple hosts in a  cluster system and hence does not require RAID  storage on the  host.  HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

Hadoop Frameworks’ consist of:

1. Hadoop distributed file system (HDFS)

2. Mapreduce

Hadoop file system: HDFS’S is breaking the large data set in to small piece (64MB) and distributing to local nodes in a cluster environments.

Mapreduce: Mapreducer is distributing the data processing works near to the data set i.e.   data set is processed locally and  it is helps   developers free to focus on application logic instead of data accessing/parallelism logic.

HDFS architecture
Typical Hadoop cluster configuration works like Master / Slave concept.

Master server is controling all the activities in the cluster and slave/Data node works for master node.

Here in the hadoop environment , Master Node server is called Name node and Slave node is called Data node.

1. NameNode server
2. Data server

Namenode: Namenode is managing the file system metadata and also provides control service to the hadoop  cluster.  There will be only one namenode process running in a hadoop file system in the cluster environment.

Backupnode: Namenode is a single point of failure in a hadoop file system environment. So to overcome this failure, Backup node is used to copy the meta data file system from the namenode at frequent interval.
Datanode: Datanode is used for storing the data and retrieval of the data. There will be multiple processes are running in a cluster.  Typically one datanode process per storage node.

Job Tracker: Job tracker accepts jobs and submissions of the jobs in a cluster environment.  It also distributing/controlling the jobs in a cluster enviroment.
Distributed jobs are handover to  TaskTracker process in a datanode.

TaskTracker: It manages the execution of the individual map and reduce task in the datanode

MapReduce :



II. Hadoop Installation:

Installation of Hadoop on a local machine in standalone mode

1. Find and install packages

$yum search hadoop
$yum install hadoop-0.20 -y

2. start the hadoop process.

# sudo /etc/init.d/hadoop-0.20-datanode start
Starting Hadoop datanode daemon (hadoop-datanode): starting datanode, logging to /usr/lib/hadoop-0.20/bin/../logs/hadoop-hadoop-datanode-datamart001.corp.ck1.abcd.com.out
[  OK  ]
# sudo /etc/init.d/hadoop-0.20-jobtracker start
Starting Hadoop jobtracker daemon (hadoop-jobtracker): starting jobtracker, logging to /usr/lib/hadoop-0.20/bin/../logs/hadoop-hadoop-jobtracker-datamart001.corp.ck1.abcd.com.out
[  OK  ]
#  sudo /etc/init.d/hadoop-0.20-namenode start
Starting Hadoop namenode daemon (hadoop-namenode): starting namenode, logging to /usr/lib/hadoop-0.20/bin/../logs/hadoop-hadoop-namenode-datamart001.corp.ck1.abcd.com.out
[  OK  ]
# sudo /etc/init.d/hadoop-0.20-secondarynamenode start
Starting Hadoop secondarynamenode daemon (hadoop-secondarynamenode): starting secondarynamenode, logging to /usr/lib/hadoop-0.20/bin/../logs/hadoop-hadoop-secondarynamenode-datamart001.corp.ck1.abcd.com.out
[  OK  ]
# sudo /etc/init.d/hadoop-0.20-tasktracker start
Starting Hadoop tasktracker daemon (hadoop-tasktracker): starting tasktracker, logging to /usr/lib/hadoop-0.20/bin/../logs/hadoop-hadoop-tasktracker-datamart001.corp.ck1.abcd.com.out

3. stop the hadoop process.

# sudo /etc/init.d/hadoop-0.20-datanode stop

Stopping Hadoop datanode daemon (hadoop-datanode): stopping datanode
[  OK  ]

# sudo /etc/init.d/hadoop-0.20-jobtracker start

Stopping Hadoop jobtracker daemon (hadoop-jobtracker): stopping jobtracker
[  OK  ]

#  sudo /etc/init.d/hadoop-0.20-namenode start
Stopping Hadoop namenode daemon (hadoop-namenode): stopping namenode
[  OK  ]

# sudo /etc/init.d/hadoop-0.20-secondarynamenode start
Stopping Hadoop secondarynamenode daemon (hadoop-secondarynamenode): stopping secondarynamenode
[  OK  ]

# sudo /etc/init.d/hadoop-0.20-tasktracker start
Stopping Hadoop tasktracker daemon (hadoop-tasktracker): stopping tasktracker
[  OK  ]

4. The NameNode monitor from a web console http://datamart001.corp.ck1.abcd.com:50070/


5. The JobTracker view from a web console http://datamart001.corp.ck1.abcd.com:50030



I. Namenode directory structure.Hadoop configuration file shown where the namenode file structure available.

#cat /etc/hadoop/conf/hdfs-site.xml

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>
<property>
<!– specify this so that running ‘hadoop namenode -format’ formats the right dir –>
<name>dfs.name.dir</name>
<value>/var/lib/hadoop-0.20/cache/hadoop/dfs/name</value>
</property>
</configuration>

/var/lib/hadoop-0.20/cache/hadoop/dfs/namesecondary/current
# ls -ltrg
-rw-rw-r– 1 hadoop  101 Dec 28 10:23 VERSION
-rw-rw-r– 1 hadoop    8 Dec 28 10:23 fstime
-rw-rw-r– 1 hadoop 2556 Dec 28 10:23 fsimage
-rw-rw-r– 1 hadoop    4 Dec 28 10:23 edits
# cat VERSION
#Sun Dec 28 10:23:12 PDT 2009
namespaceID=1581900343
cTime=0
storageType=NAME_NODE
layoutVersion=-18

II. Datanode directory structure.

/var/lib/hadoop-0.20/cache/hadoop/dfs/data/current
# ls -ltrg
-rw-rw-r– 1 hadoop    31 Dec 19 10:51 blk_-8775109834191680755_1008.meta
-rw-rw-r– 1 hadoop  3032 Dec 19 10:51 blk_-8775109834191680755
-rw-rw-r– 1 hadoop    11 Dec 19 10:51 blk_7264600866087375563_1010.meta
-rw-rw-r– 1 hadoop   496 Dec 19 10:51 blk_7264600866087375563
-rw-rw-r– 1 hadoop    43 Dec 19 10:51 blk_-7135936775282395850_1009.meta
-rw-rw-r– 1 hadoop  4190 Dec 19 10:51 blk_-7135936775282395850
-rw-rw-r– 1 hadoop    11 Dec 19 10:51 blk_-6839088824963056379_1011.meta
-rw-rw-r– 1 hadoop   213 Dec 19 10:51 blk_-6839088824963056379
-rw-rw-r– 1 hadoop    11 Dec 19 10:51 blk_-5791348377539587198_1007.meta
-rw-rw-r– 1 hadoop   338 Dec 19 10:51 blk_-5791348377539587198
-rw-rw-r– 1 hadoop    39 Dec 19 10:51 blk_1560465537198859072_1006.meta
-rw-rw-r– 1 hadoop  3936 Dec 19 10:51 blk_1560465537198859072
-rw-rw-r– 1 hadoop   143 Dec 19 10:52 blk_-2000923735053060286_1022.meta
-rw-rw-r– 1 hadoop 17068 Dec 19 10:52 blk_-2000923735053060286
-rw-rw-r– 1 hadoop    11 Dec 19 10:53 blk_4157843498084069450_1023.meta
-rw-rw-r– 1 hadoop    62 Dec 19 10:53 blk_4157843498084069450
-rw-rw-r– 1 hadoop    67 Dec 19 10:53 blk_3729274841206521562_1023.meta
-rw-rw-r– 1 hadoop  7588 Dec 19 10:53 blk_3729274841206521562
-rw-rw-r– 1 hadoop   158 Dec 26 12:16 VERSION
-rw-rw-r– 1 hadoop    11 Dec 26 12:16 blk_8306638076952770016_1025.meta
-rw-rw-r– 1 hadoop     4 Dec 26 12:16 blk_8306638076952770016
-rw-rw-r– 1 hadoop    11 Dec 27 11:20 blk_6997061879063474330_1026.meta
-rw-rw-r– 1 hadoop    26 Dec 27 11:20 blk_6997061879063474330
-rw-rw-r– 1 hadoop  2215 Dec 27 12:35 dncp_block_verification.log.curr

#Fri Mar 26 12:16:07 PDT 2009
namespaceID=1581900343
storageID=DS-418830811-10.72.148.142-50010-1268948984374
cTime=0
storageType=DATA_NODE
layoutVersion=-18

III. Namesecondary directory structure.

/var/lib/hadoop-0.20/cache/hadoop/dfs/namesecondary/current
# ls -ltrg
-rw-rw-r– 1 hadoop  101 Mar 28 10:23 VERSION
-rw-rw-r– 1 hadoop    8 Mar 28 10:23 fstime
-rw-rw-r– 1 hadoop 2556 Mar 28 10:23 fsimage
-rw-rw-r– 1 hadoop    4 Mar 28 10:23 edits
# cat VERSION
#Mon Dec 28 10:23:12 PDT 2009
namespaceID=1581900343
cTime=0
storageType=NAME_NODE
layoutVersion=-18

to be continued……
Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: