hadoop面试时恐怕遇到的主题素材,你能回复出多少个 ?

Hadoop法定文书档案

目标

该文档作为一份个人引导周详性得描述了全体用户接纳Hadoop
Mapreduce框架时遇见的全数。

面试hadoop大概被问到的主题素材,你能答应出多少个 ?

MapReduce
Tutorial(个人引导)

预备条件

确定保障Hadoop安装、配置和平运动行。越多细节:

       初次使用用户配置单节点。

       配置大型、布满式集群。

1、hadoop运维的原理?

Purpose(目的)

综述

Hadoop
Mapreduce是三个轻巧编制程序况兼能在巨型集群(上千节点)火速地相互得管理多量多少的软件框架,以保障,容错的诀要安插在商用机器上。

MapReduce
Job日常将单身大块数据切成丝以完全并行的方法在map任务中拍卖。该框架对maps输出的做为reduce输入的数额开始展览排序,Job的输入输出都是积存在文件系统中。该框架调解任务、监察和控制职务和重启失效的职分。

一般的话总结节点和存款和储蓄节点没什么不相同的安装,MapReduce框架和HDFS运维在同组节点。那样的设定使得MapReduce框架能够以更加高的带宽来实行任务,当数码已经在节点上时。

MapReduce
框架富含一个主ResourceManager,种种集群节点都有叁个从NodeManager和每种应用都有贰个MRAppMaster。

选取最少必须钦点输入和出口的门道何况通过兑现方便的接口可能抽象类来提供map和reduce功用。后边那某个内容和其余Job参数构成了Job的陈设。

Hadoop
客户端提交Job和计划音讯给ResourceManger,它将承受把铺排新闻分配给从属节点,调整职务并且监察和控制它们,把状态音信和确诊音讯传输给客户端。

就算 MapReduce 框架是用Java完毕的,可是 MapReduce
应用却不必然要用Java编写。

Hadoop Streaming 是叁个工具允许用户创立和平运动作任何可施行文件。

Hadoop Pipes 是合作SWIG用来贯彻 MapReduce 应用的C++ API(不是基于JNI).

2、mapreduce的原理?

Prerequisites(必备条件)

输入和出口

MapReduce 框架只操作键值对,MapReduce
将job的分裂门类输入当做键值对来管理并且生成一组键值对作为出口。

Key和Value类必须透过落到实处Writable接口来完毕系列化。别的,Key类必须完结WritableComparable
来驱动排序更简便。

MapRedeuce job 的输入输出类型:

(input) ->map->  ->combine->  ->reduce-> (output)

3、HDFS存款和储蓄的体制?

Overview(综述)

MapReduce – 用户接口

那有的将展现 MapReduce
中面向用户方面包车型客车尽心多的内部原因。那将会接济用户更加小粒度地达成、配置和调整它们的Job。但是,请在
Javadoc 中查看各类类和接口的总结用法,这里仅仅是当做一份辅导。

让我们第一来看看Mapper和Reducer接口。应用一般只兑现它们提供的map和reduce方法。

笔者们将会研究别的接口包含Job、Partitioner、InputFormat和其余的。

最终,大家会商讨一些可行的特色像布满式缓存、隔绝运转等。

4、举三个简短的例子表达mapreduce是怎么来运作的 ?

Inputs and
Outputs(输入输出)

有效载荷

行使一般达成Mapper和Reducer接口提供map和reduce方法。那是Job的宗旨代码。

5、面试的人给您出一部分标题,令你用mapreduce来完毕?

MapReduce – User
Interfaces(用户接口)

Mapper

Mappers将输入的键值对转变到人中学间键值对。

Maps是四个单身试行的天职将输入转变到人中学间记录。这么些被退换的中级记录不肯定要和输入的笔录为同一等级次序。输入键值对能够在map后输出0也许更加的多的键值对。

MapReduce 会依照 InputFormat 切分成的顺序 InputSplit 都创制贰个map职务

看来,通过
job.setMapperClass(Class)来给Job设置Mapper完毕类,并且将InputSplit输入到map方法开始展览拍卖。应用可复写cleanup方法来推行其他索要回收清除的操作。

出口键值对不自然要和输入键值对为一样的档案的次序。一个键值对输入能够输出0至七个差异的键值对。输出键值对将通过context.write(WritableComparable,Writable)方法进行缓存。

选拔能够因而Counter进行总计。

持有的高中级值都会规行矩步Key进行排序,然后传输给一个特定的Reducer做最终鲜明的出口。用户能够经过Job.setGroupingComparatorClass(Class)来调节分组准绳。

Mapper输出会被排序况且分区到每三个Reducer。分区数和Reduce的数量是平等的。用户能够通过落到实处贰个自定义的Partitioner来支配哪个key对应哪个Reducer。

用户能够随意钦命叁个combiner,Job.setCombinerClass(Class),来施行局地输出数据的咬合,将实惠地回退Mapper和Reducer之间的数目传输量。

那多少个经过排序的中档记录普通会以(key-len, key, value-len,
value)的简短格式积攒。应用能够经过计划来支配是或不是供给和哪些压缩数量和抉择压缩格局。

     
比如:未来有十个文件夹,各个文件夹都有一千000个url.未来令你找寻top一千000url。

Payload(有效载荷)

**How Many Maps? **

maps的多少一般重视于输入数据的总省长度,也正是,输入文书档案的总block数。

每种节点map的正规并行度应该在10-100以内,固然每一个cpu已经设置的上限值为300。职务的配备会费用一些光阴,最少要求开销一分钟来运营运作。

于是,若是您有10TB的数额输入和定义blocksize为128M,那么你将索要8三千maps,除非通过Configuration.set(M中华VJobConfig.NUM_MAPS,
int)(设置一个默许值布告框架)来安装更高的值。

6、hadoop中Combiner的作用?

  • Mapper
  • Reducer
  • Partitioner
  • Counter

下边是原版的书文


Src:

Job
Configuration(作业配置)

Purpose

This document comprehensively describes all user-facing facets of the
Hadoop MapReduce framework and serves as a tutorial.

**Q1.
Name the most common InputFormats defined in Hadoop? Which one is
default ? ***
Following

Task Execution &
Environment(义务执行和条件)

Prerequisites

Ensure that Hadoop is installed, configured and is running. More
details:

    Single Node
Setupfor
first-time users.

    Cluster
Setupfor
large, distributed clusters.

2 are most common InputFormats defined in Hadoop 

  • Memory
    Management(内部存款和储蓄器管理)
  • Map
    Parameters(Map参数)
  • Shuffle/Reduce
    Parameters(Shuffle/Reduce参数)
  • Configured
    Parameters(配置参数)
  • Task
    Logs(职务日志)
  • Distributing
    Libraries(遍布式缓存
    库)

Overview

Hadoop MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets)
in-parallel on large clusters (thousands of nodes) of commodity hardware
in a reliable, fault-tolerant manner.

A MapReducejobusually splits the input data-set into independent
chunks which are processed by themap tasksin a completely parallel
manner.The framework sorts the outputs of the maps, which are then input
to thereducetasks. Typically both the input and the output of the job
are stored in a file-system. The framework takes care of scheduling
tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is,
the MapReduce framework and the HadoopDistributed File System
(seeHDFSArchitecture
Guide) are running on the same set of nodes. This configuration allows
the framework to effectively schedule tasks on the nodes where data is
already present, resulting in very highaggregate bandwidthacross the
cluster.

The MapReduce framework consists of a single master ResourceManager, one
slave NodeManager per cluster-node, and MRAppMaster per application
(seeYARNArchitecture
Guide).

Minimally, applications specify the input/output locations and
supplymapandreducefunctions via implementations of appropriate
interfaces and/or abstract-classes. These, and other job
parameters,comprise thejobconfiguration.

The Hadoopjob clientthen submits the job(jar/executable etc.) and
configuration to the ResourceManager which then assumes the
responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic
information to the job-client.

Although the Hadoop framework is implemented in Java, MapReduce
applications need not be written in Java.

Hadoop
Streamingis
    a utility which allows users to create and run jobs with any
executables     (e.g. shell utilities) as the mapper and/or the reducer.

Hadoop
Pipesis
aSWIG-compatible  C++
API to implement MapReduce applications (non JNI based).

TextInputFormat

Job Submission and
Monitoring(作业提交和监察)

Inputs and Outputs

The MapReduce framework operates exclusively on value pairs, that is,
the framework views the input to the job as a set of pairs and produces
a set of pairs as the output of the job,conceivably of different types.

The key and value classes have to be serializable by the framework and
hence need to implement
theWritableinterface.
Additionally, the key classes have to implement
theWritableComparableinterface
to facilitate sorting by the framework.

Input and Output types of a MapReduce job:

(input) v1> ->map->  ->combine->
 ->reduce-> (output)

KeyValueInputFormat

SequenceFileInputFormat***

Q2. What is the
difference between TextInputFormatand KeyValueInputFormat
class
TextInputFormat: It
reads lines of text files and provides the offset of the line as key to
the Mapper and actual line as Value to the mapper
KeyValueInputFormat:
Reads text file and parses lines into key, val pairs. Everything up to
the first tab character is sent as key to the Mapper and the remainder
of the line is sent as value to the mapper.

*Q3. What is InputSplit in Hadoop*

When a hadoop job is run, it splits input files into chunks and
assign each split to a mapper to process. This is called Input Split 

*Q4. How is the splitting of file invoked in Hadoop Framework *

It is invoked by the Hadoop framework by running
getInputSplit()method of the Input format class (like FileInputFormat)
defined by the user 

Q5. Consider case scenario: In M/R system,

    – HDFS block size is 64 MB

    – Input format is FileInputFormat

    – We have 3 files of size 64K, 65Mb and 127Mb 

*then how many input splits will be made by Hadoop framework?*

Hadoop will make 5 splits as follows 

– 1 split for 64K files 

– 2  splits for 65Mb files 

– 2 splits for 127Mb file 

*Q6. What is the purpose of RecordReader in Hadoop*

The InputSplithas defined a slice of work, but does not describe how to
access it. The RecordReaderclass actually loads the data from its source
and converts it into (key, value) pairs suitable for reading by the
Mapper. The RecordReader instance is defined by the InputFormat 

*Q7. After the Map phase finishes, the hadoop framework does
“Partitioning, Shuffle and sort”. Explain what happens in this phase?*

– Partitioning

Partitioning is the process of determining which reducer instance will
receive which intermediate keys and values. Each mapper must determine
for all of its output (key, value) pairs which reducer will receive
them. It is necessary that for any key, regardless of which mapper
instance generated it, the destination partition is the same

– Shuffle

After the first map tasks have completed, the nodes may still be
performing several more map tasks each. But they also begin exchanging
the intermediate outputs from the map tasks to where they are required
by the reducers. This process of moving map outputs to the reducers is
known as shuffling.

– Sort

Each reduce task is responsible for reducing the values associated with
several intermediate keys. The set of intermediate keys on a single node
is automatically sorted by Hadoop before they are presented to the
Reducer 

*Q9. If no custom partitioner is defined in the hadoop then how is
data partitioned before its sent to the reducer*
 

The default partitioner computes a hash value for the key and assigns
the partition based on this result 

Q10. What is a Combiner 

The Combiner is a “mini-reduce” process which operates only on data
generated by a mapper. The Combiner will receive as input all data
emitted by the Mapper instances on a given node. The output from the
Combiner is then sent to the Reducers, instead of the output from the
Mappers.

Q11. Give an example scenario where a cobiner can be used and where
it cannot be used

There can be several examples following are the most common ones

– Scenario where you can use combiner

  Getting list of distinct words in a file

– Scenario where you cannot use a combiner

  Calculating mean of a list of numbers 

Q12. What is job tracker

Job Tracker is the service within Hadoop that runs Map Reduce jobs
on the cluster

Q13. What are some typical functions of Job Tracker

The following are some typical tasks of Job Tracker

– Accepts jobs from clients

– It talks to the NameNode to determine the location of the data

– It locates TaskTracker nodes with available slots at or near the data

– It submits the work to the chosen Task Tracker nodes and monitors
progress of each task by receiving heartbeat signals from Task tracker 

Q14. What is task tracker

Task Tracker is a node in the cluster that accepts tasks like Map,
Reduce and Shuffle operations – from a JobTracker 



*Q15. Whats the relationship between Jobs and Tasks in Hadoop*

One job is broken down into one or many tasks in Hadoop

*Q16. Suppose Hadoop spawned 100 tasks for a job and one of the
task failed. What will
hadoop do ?*

It will restart the task again on some other task tracker and only if
the task fails more than 4 (default setting and can be changed) times
will it kill the job

*Q17. Hadoop achieves parallelism by dividing the tasks across
many nodes, it is possible for a few slow nodes to rate-limit the rest
of the program and slow down the program. What
mechanism 
Hadoop provides to combat this*  

Speculative Execution 

*Q18. How does speculative execution works in Hadoop *

Job tracker makes different task trackers process same input. When tasks
complete, they announce this fact to the Job Tracker. Whichever copy of
a task finishes first becomes the definitive copy. If other copies were
executing speculatively, Hadoop tells the Task Trackers to abandon
the tasks and discard their outputs. The Reducers then receive their
inputs from whichever Mapper completed successfully, first. 

Q19. Using command line in Linux, how will you 

*- see all jobs running in the hadoop cluster*

– kill a job

– hadoop job -list

– hadoop job -kill jobid 

*Q20. What is Hadoop Streaming *

Streaming is a generic API that allows programs written in virtually any
language to be used asHadoop Mapper and Reducer implementations 

**Q21. What is the characteristic of streaming API that makes it
flexible run map reduce jobs in languages like perl, ruby, awk etc. **

Hadoop Streaming allows to use arbitrary programs for the Mapper and
Reducer phases of a Map Reduce job by having both Mappers and Reducers
receive their input on stdin and emit output (key, value) pairs on
stdout.

*Q22.
Whats is Distributed Cache in 
Hadoop*
Distributed
Cache is a facility provided by the Map/Reduce framework to cache files
(text, archives, jars and so on) needed by applications during execution
of the job. The framework will copy the necessary files to the slave
node before any tasks for the job are executed on that node.

Q23. What is the benifit of
Distributed cache, why can we just have the file in HDFS and have the
application read it 

This is because distributed cache is much faster. It copies the file to
all trackers at the start of the job. Now if the task tracker runs 10 or
100 mappers or reducer, it will use the same copy of distributed cache.
On the other hand, if you put code in file to read it from HDFS in the
MR job then every mapper will try to access it from HDFS hence if a task
tracker run 100 map jobs then it will try to read this file 100 times
from HDFS. Also HDFS is not very efficient when used like this.

*Q.24 What mechanism does Hadoop framework provides to synchronize
changes made in Distribution Cache during runtime of the
application *

This is a trick questions. There is no such mechanism. Distributed
Cache by design is read only during the time of Job execution

*Q25. Have you ever used Counters in Hadoop. Give us an example
scenario*

Anybody who claims to have worked on a Hadoop project is expected to
use counters

*Q26. Is it possible to provide multiple input to Hadoop? If yes
then how can you give multiple directories as input to
the 
Hadoop job *
Yes, The input format class provides methods to add multiple directories
as input to a Hadoop job

*Q27. Is it possible to have Hadoop job output in multiple
directories. If yes then how *

Yes, by using Multiple Outputs class

*Q28. What will a hadoop** job do if you try to run it with an
output directory that is already present? Will it

  • overwrite it
  • warn you and continue
  • throw an exception and exit*
    The 
    hadoop** job will throw an exception and exit.

*Q29. How can you set an arbitary number of mappers to be created for
a job in 
Hadoop *
This is a trick question. You cannot set it

*Q30. How can you set an arbitary number of reducers to be created for
a job in 
Hadoop *
You can either do it progamatically by using method setNumReduceTasksin
the JobConfclass or set it up as a configuration setting

Src:

  • Job
    Control(作业调控)

MapReduce – User Interfaces

This section provides a reasonable amount of detail on every user-facing
aspect of the MapReduce framework. This should help users implement,
configure and tune their jobs in a fine-grained manner. However, please
note that the java doc for each class/interface remains the most
comprehensive documentation available; this is only meant to be a
tutorial.

Let us first take the Mapper and Reducer interfaces. Applications
typically implement them to provide the map and reduce methods.

We will then discuss other core interfaces including
Job,Partitioner,InputFormat,OutputFormat, and others.

Finally, we will wrap up by discussing some useful features of the
framework such as

the Distributed Cache,Isolation Runner etc.

Job
Input(作业输入)

Payload

Applications typically implement the Mapper and Reducer interfaces to
provide the map and reduce methods. These form the core of the job.

  • InputSplit(输入块)
  • RecordReader(记录读取器)

Mapper

Mappermaps
input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into
intermediate records.The transformed intermediate records do not need to
be of the same type as the input records. A given input pair may map to
zero or many output pairs.

The Hadoop MapReduce framework spawns one map task for each InputSplit
generated by the InputFormat for the job.

Overall,Mapper implementations are passed the Job for the job via
theJob.setMapperClass(Class)method.

The framework then
callsmap(WritableComparable,金沙js333娱乐场,Writable,
Context)for each key/value pair in

the InputSplit for that task. Applications can then override the
cleanup(Context)method to perform any required cleanup.

Output pairs do not need to be of the same types as input pairs. A given
input pair may map to zero or many output pairs. Output pairs are
collected with calls to context.write(WritableComparable, Writable).

Applications can use the Counter to report its statistics.

All intermediate values associated with a given output key are
subsequently(随后)grouped by the framework, and passed to the
Reducer(s) to determine the final output. Users can control the grouping
by specifying a Comparator
viaJob.setGroupingComparatorClass(Class).

The Mapper outputs are sorted and then partitioned per Reducer. The
total number of partitions is the same as the number of reduce tasks for
the job. Users can control which keys (and hence records) go to which
Reducer by implementing a custom Partitioner.

Users can optionally specify a combiner,
viaJob.setCombinerClass(Class),
to perform local aggregation of the intermediate outputs, which helps to
cut down the amount of data transferred from the Mapper to the Reducer.

The intermediate, sorted outputs are always stored in a simple
(key-len, key,
value-len,value) format. Applications
can control if, and how, the intermediate outputs are to be compressed
and theCompression
Codecto
be used via the Configuration.

Job
Output(作业输出)

**How Many Maps?**

The number of maps is usually driven by the total size of the inputs,
that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps
per-node, although it has been set up to 300 maps for very cpu-light map
tasks. Task setup takes a while, so it is best if the maps take at least
a minute to execute.

Thus,if you expect 10TB of input data and have a blocksize of 128 MB,
you’ll end up with 82,000
maps,unlessConfiguration.set(MRJobConfig.NUM_MAPS,
int)(which only provides a hint to the framework) is used to set it even
higher.

出于翻译技巧不足所出现的荒谬,请多多建议和谅解

  • OutputCommitter(输出提交器)
  • Task Side-Effect
    Files(职分副文件)
  • RecordWriter(记录输出器)

Other Useful
Features(其余有效的表征)

  • Submitting Jobs to
    Queues(提交作业到行列中)
  • Counters(计数器)
  • DistributedCache(布满式缓存)
  • Profiling(分析器)
  • Debugging(调试器)
  • Data
    Compression(数据压缩)
  • Skipping Bad
    Records(跳过不成数据数据)

Purpose

This document comprehensively describes all user-facing facets of the
Hadoop MapReduce framework and serves as a tutorial.

该文书档案作为一份个人教导周到性得描述了富有用户选拔Hadoop
Mapreduce框架时蒙受的成套。

Prerequisites

Ensure that Hadoop is installed, configured and is running. More
details:

    • Single Node
      Setup for
      first-time users.
    • Cluster
      Setup for
      large, distributed clusters.

确认保证Hadoop安装、配置和平运动行。越来越多细节:

    • 初次使用用户配置单节点。
    •  配置大型、布满式集群

Overview

Hadoop MapReduce is a software framework for easily writing applications
which process vast amounts of data (multi-terabyte data-sets)
in-parallel on large clusters (thousands of nodes) of commodity hardware
in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent
chunks which are processed by the map tasks in a completely parallel
manner. The framework sorts the outputs of the maps, which are then
input to the reduce tasks. Typically both the input and the output of
the job are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is,
the MapReduce framework and the Hadoop Distributed File System
(see HDFS Architecture
Guide)
are running on the same set of nodes. This configuration allows the
framework to effectively schedule tasks on the nodes where data is
already present, resulting in very high aggregate bandwidth across the
cluster.

The MapReduce framework consists of a single master ResourceManager, one
slave NodeManager per cluster-node, and MRAppMaster per application
(see YARN Architecture
Guide).

Minimally, applications specify the input/output locations and
supply map and reduce functions via implementations of appropriate
interfaces and/or abstract-classes. These, and other job parameters,
comprise the job configuration.

The Hadoop job client then submits the job (jar/executable etc.) and
configuration to the ResourceManager which then assumes the
responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic
information to the job-client.

Although the Hadoop framework is implemented in Java™, MapReduce
applications need not be written in Java.

    • Hadoop
      Streaming is
      a utility which allows users to create and run jobs with any
      executables (e.g. shell utilities) as the mapper and/or the
      reducer.
    • Hadoop
      Pipes is
      a SWIG-compatible C++ API to implement
      MapReduce applications (non JNI™ based).

Hadoop
Mapreduce是三个便于编制程序並且能在大型集群(上千节点)快捷地互相得管理一大波数据的软件框架,以可信,容错的法门安插在商用机器上。

MapReduce
Job平常将独自大块数据切条以完全并行的方法在map职责中管理。该框架对maps输出的做为reduce输入的多少进行排序,Job的输入输出都以储存在文件系统中。该框架调节义务、监察和控制职责和重启失效的天职。

诚如的话总括节点和积攒节点都以一律的装置,MapReduce框架和HDFS运营在同组节点。那样的设定使得MapReduce框架能够以更加高的带宽来推行职责,当数码已经在节点上时。

MapReduce
框架蕴含一个主ResourceManager,各样集群节点都有二个从NodeManager和每一个应用都有一个MRAppMaster。

使用最少必须内定输入和输出的门路並且经过落到实处适度的接口或然抽象类来提供map和reduce效能。前边那有的内容和别的Job参数构成了Job的布局。

Hadoop
客户端提交Job和布置音讯给ResourceManger,它将负责把布置新闻分配给从属节点,调治任务而且监察和控制它们,把情状新闻和会诊新闻传输给客户端。

  固然 MapReduce 框架是用Java完结的,可是 MapReduce
应用却不自然要用Java编写。

    • Hadoop Streaming
      是二个工具允许用户创制和平运动作任何可实施文件。
    • Hadoop Pipes 是合营SWIG用来实现 MapReduce 应用的C++
      API(不是基于JNI).

Inputs and Outputs

The MapReduce framework operates exclusively on <key,
value> pairs, that is, the framework views the input to the job as a
set of <key, value> pairs and produces a set of <key,
value>pairs as the output of the job, conceivably of different types.

The key and value classes have to be serializable by the framework and
hence need to implement
the Writable interface.
Additionally, the key classes have to implement
theWritableComparable interface
to facilitate sorting by the framework.

Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2>
-> combine -> <k2, v2> -> reduce -> <k3,
v3> (output)

MapReduce 框架只操作键值对,MapReduce
将job的例外体系输入当做键值对来管理况兼生成一组键值对作为出口。

Key和Value类必须透过落实Writable接口来促成系列化。另外,Key类必须完结WritableComparable
来驱动排序更简约。

MapRedeuce job 的输入输出类型:

(input) ->map->  ->combine->  ->reduce-> (output)

MapReduce – User Interfaces

This section provides a reasonable amount of detail on every user-facing
aspect of the MapReduce framework. This should help users implement,
configure and tune their jobs in a fine-grained manner. However, please
note that the javadoc for each class/interface remains the most
comprehensive documentation available; this is only meant to be a
tutorial.

Let us first take the Mapper and Reducer interfaces. Applications
typically implement them to provide the map and reduce methods.

We will then discuss other core interfaces
including Job, Partitioner, InputFormat, OutputFormat, and others.

Finally, we will wrap up by discussing some useful features of the
framework such as the DistributedCache, IsolationRunner etc.

这一部分将体现 MapReduce
中面向用户方面的尽心多的细节。这将会协助用户更加小粒度地贯彻、配置和调整它们的Job。不过,请在
Javadoc 中查看种种类和接口的汇总用法,这里仅仅是当做一份携带。

让我们首先来探问Mapper和Reducer接口。应用一般只兑现它们提供的map和reduce方法。

大家将构和谈其余接口满含Job、Partitioner、InputFormat和别的的。

最后,我们会研商一些得力的特色像布满式缓存、隔开分离运维等。

Payload

Applications typically implement the Mapper and Reducer interfaces to
provide the map and reduce methods. These form the core of the job.

选用普通完毕Mapper和Reducer接口提供map和reduce方法。那是Job的为主代码。

Mapper

Mapper maps
input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into
intermediate records. The transformed intermediate records do not need
to be of the same type as the input records. A given input pair may map
to zero or many output pairs.

The Hadoop MapReduce framework spawns(产卵) one map task for
each InputSplit generated by the InputFormat for the job.

Overall, Mapper implementations are passed the Job for the job via
the Job.setMapperClass(Class) method.
The framework then calls map(WritableComparable, Writable,
Context) for
each key/value pair in the InputSplit for that task. Applications can
then override the cleanup(Context) method to perform any required
cleanup.

Output pairs do not need to be of the same types as input pairs. A given
input pair may map to zero or many output pairs. Output pairs are
collected with calls to context.write(WritableComparable, Writable).

Applications can use the Counter to report its statistics.

All intermediate(中间的) values associated(联系) with a given output
key are subsequently(随后) grouped by the framework, and passed to
the Reducer(s) to determine the final output. Users can control the
grouping by specifying
a Comparator via Job.setGroupingComparatorClass(Class).

The Mapper outputs are sorted and then partitioned per Reducer. The
total number of partitions is the same as the number of reduce tasks for
the job. Users can control which keys (and hence records) go to
which Reducer by implementing a custom Partitioner.

Users can optionally(随意) specify a combiner,
via Job.setCombinerClass(Class),
to perform local aggregation of the intermediate outputs, which helps to
cut down the amount of data transferred from the Mapper to the Reducer.

The intermediate, sorted outputs are always stored in a simple (key-len,
key, value-len, value) format. Applications can control if, and how, the
intermediate outputs are to be compressed and
the CompressionCodec to
be used via the Configuration.

Mappers将输入的键值对转变来人中学间键值对。

Maps是七个独立实行的任务将输入转变来人中学间记录。那贰个被调换的中档记录不自然要和输入的记录为同样系列。输入键值对可以在map后输出0大概越多的键值对。

MapReduce 会根据 InputFormat 切分成的相继 InputSplit
都创制二个map职责

因而看来,通过
job.setMapperClass(Class)来给Job设置Mapper达成类,并且将InputSplit输入到map方法开始展览处理。应用可复写cleanup方法来实施别的部要求要回收清除的操作。

输出键值对不必然要和输入键值对为同一的花色。二个键值对输入能够输出0至七个例外的键值对。输出键值对将经过context.write(WritableComparable,Writable)方法开始展览缓存。

运用能够经过Counter进行总括。

装有的中级值都会遵照Key进行排序,然后传输给三个特定的Reducer做最终鲜明的输出。用户能够由此Job.setGroupingComparatorClass(Class)来支配分组准绳。

Mapper输出会被排序而且分区到每叁个Reducer。分区数和Reduce的数量是一致的。用户可以透过完结多少个自定义的Partitioner来决定哪个key对应哪个Reducer。

用户能够Infiniti制内定叁个combiner,Job.setCombinerClass(Class),来推行局地输出数据的结合,将实用地降落Mapper和Reducer之间的数目传输量。

这几个通过排序的中等记录普通会以(key-len, key, value-len,
value)的简约格式积攒。应用能够通过安顿来调整是不是需求和如何压缩数量和甄选压缩格局。

 

  How Many Maps?

The number of maps is usually driven by the total size of the inputs,
that is, the total number of blocks of the input files.

The right level of parallelism(平行) for maps seems to be around
10-100 maps per-node, although it has been set up to 300 maps for very
cpu-light map tasks. Task setup takes a while, so it is best if the maps
take at least a minute to execute.

Thus, if you expect 10TB of input data and have a blocksize of 128MB,
you’ll end up with 82,000 maps, unless
Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a
hint to the framework) is used to set it even higher.

   maps的数据一般重视于输入数据的总长度,也便是,输入文书档案的总block数。

各类节点map的日常并行度应该在10-100以内,就算种种cpu已经设置的上限值为300。职分的布署会开销一些时日,最少须要花费一分钟来运营运作。

为此,假如您有10TB的多少输入和定义blocksize为128M,那么你将急需82000maps,除非通过Configuration.set(MPAJEROJobConfig.NUM_MAPS,
int)(设置贰个暗许值布告框架)来安装越来越高的值。

 

Reducer

Reducer reduces
a set of intermediate values which share a key to a smaller set of
values.

The number of reduces for the job is set by the user
via Job.setNumReduceTasks(int).

Overall(总的来讲), Reducer implementations are passed the Job for the
job via
the Job.setReducerClass(Class) method
and can override it to initialize themselves. The framework then
callsreduce(WritableComparable, Iterable<Writable>,
Context) method
for each <key, (list of values)> pair in the grouped inputs.
Applications can then override the cleanup(Context)method to perform any
required cleanup.

Reducer has 3 primary(主要) phases(阶段): shuffle, sort and reduce.

   Reduce管理一多元同样key的中等记录。

用户能够透过 Job.setNumReduceTasks(int) 来设置reduce的多少。

因而看来,通过 Job.setReducerClass(Class) 可以给 job 设置 recuder 的实现类何况张开开头化。框架将会调用 reduce 方法来管理每一组根据一定准绳分好的输入数据,应用可以由此复写cleanup 方法实践别的清理专门的学业。

Reducer有3个根本阶段:混洗、排序和reduce。

 

Shuffle

Input to the Reducer is the sorted output of the mappers. In this phase
the framework fetches(取得) the relevant(有关的,恰当的) partition
of the output of all the mappers, via HTTP.

输出到Reducer的多寡都在Mapper阶段经过排序的。在那一个阶段框架将通过HTTP从合适的Mapper的分区中获得数据。

 

Sort

The framework groups Reducer inputs by keys (since different mappers may
have output the same key) in this stage(阶段).

The shuffle and sort phases occur simultaneously(同时); while
map-outputs are being fetched they are merged.

这一个等第框架将对输入到的 Reducer 的多寡经过key(不一样的 Mapper 恐怕输出同样的key)举办分组。

混洗和排序阶段是还要开始展览;map的出口数据被拿走时会实行合併。

发表评论

电子邮件地址不会被公开。 必填项已用*标注