Skip to content

Commit d3b07e1

Browse files
author
Thomas Weise
committed
Improved Hadoop Docu
1 parent b9b70dd commit d3b07e1

File tree

2 files changed

+21
-2
lines changed

2 files changed

+21
-2
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ We now focus on how the computing power of massive clusters can be utilized for
9797

9898
#### 2.4.2. MapReduce with Hadoop: Large-Scale Data Processing
9999

100-
In a final step we discuss a technology which combines the ability to create large-scale distributed computations (from the MPI world) with the rich tool support of the Java ecosystem: [MapReduce](https://en.wikipedia.org/wiki/MapReduce) with [Apache](http://hadoop.apache.org/) [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop). MPI is the technology of choice if communication is expensive and the bottleneck of our application, frequent communication is required between processes solving related sub-problems, the available hardware is homogenous, processes need to be organized in groups or topological structures to make efficient use of collective communication to achieve high performance, the size of data that needs to be transmitted is smaller in comparison to runtime of computations, and when we do not need to worry much about exchanging daa with a heterogeneous distributed application environment. Hadoop, on the other hand, covers use cases where communication is not the bottleneck, because computation takes much longer than communication (think Machine Learning), when the environment is heterogeneous, processes do not need to be organized in a special way and the division of tasks into sub-problems can be done efficiently by just slicing the input data into equal-sized pieces, where sub-problems have batch job character, where data is unstructured (e.g., text) and potentially huge (eating away the advantages of MPI-style communication), or where data comes from and results must be pushed back to other applications in the environment, say to HTTP/Java Servlet/Web Service stacks. Our [Hadoop examples](http://github.com/thomasWeise/distributedComputingExamples/tree/master/hadoop/) focus on the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) pattern (which is a tiny little bit similar to scatter/gather/reduce in MPI, just for the scenario described above).
100+
In a final step we discuss a technology which combines the ability to create large-scale distributed computations (from the MPI world) with the rich tool support of the Java ecosystem: [MapReduce](https://en.wikipedia.org/wiki/MapReduce) with [Apache](http://hadoop.apache.org/) [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop). MPI is the technology of choice if communication is expensive and the bottleneck of our application, frequent communication is required between processes solving related sub-problems, the available hardware is homogenous, processes need to be organized in groups or topological structures to make efficient use of collective communication to achieve high performance, the size of data that needs to be transmitted is smaller in comparison to runtime of computations, and when we do not need to worry much about exchanging data with a heterogeneous distributed application environment. Hadoop, on the other hand, covers use cases where communication is not the bottleneck, because computation takes much longer than communication (think Machine Learning), when the environment is heterogeneous, processes do not need to be organized in a special way and the division of tasks into sub-problems can be done efficiently by just slicing the input data into equal-sized pieces, where sub-problems have batch job character, where data is unstructured (e.g., text) and potentially huge (eating away the advantages of MPI-style communication), or where data comes from and results must be pushed back to other applications in the environment, say to HTTP/Java Servlet/Web Service stacks. Our [Hadoop examples](http://github.com/thomasWeise/distributedComputingExamples/tree/master/hadoop/) focus on the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) pattern (which is a tiny little bit similar to scatter/gather/reduce in MPI, just for the scenario described above).
101101

102102
### 2.5. Summary
103103

hadoop/README.md

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,25 @@
11
# Hadoop Examples
22

3-
[Apache](http://hadoop.apache.org/) [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) is a framework for performing large-scale distributed computations in a cluster. Different from [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (which we discussed [here](http://github.com/thomasWeise/distributedComputingExamples/tree/master/mpi/)), it is based on Java technologies. It may thus be slower than MPI, but can reap the full benefit of the rich libraries and programming environment of the Java ecosystem (just think about all the things we already did in this course). One of the most well-known ways to use Hadoop is to perform computations following the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) pattern (which is a tiny little bit similar to scatter/gather/reduce in MPI).
3+
[Apache](http://hadoop.apache.org/) [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop) is a framework for performing large-scale distributed computations in a cluster. Different from [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) (which we discussed [here](http://github.com/thomasWeise/distributedComputingExamples/tree/master/mpi/)), it is based on Java technologies. It may thus be slower than MPI, but can reap the full benefit of the rich libraries and programming environment of the Java ecosystem (just think about all the things we already did in this course). One of the most well-known ways to use Hadoop is to perform computations following the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) pattern (which is a tiny little bit similar to scatter/gather/reduce in MPI).
4+
5+
Let us now shortly compare use cases of MPI versus MapReduce with Hadoop. MPI is the technology of choice if
6+
7+
- communication is expensive and the bottleneck of our application,
8+
- frequent communication is required between processes solving related sub-problems,
9+
- the available hardware is homogenous (and we can use an MPI implementation optimized for it),
10+
- processes need to be organized in groups or topological structures to make efficient use of collective communication to achieve high performance,
11+
- the size of data that needs to be transmitted is smaller in comparison to runtime of computations, and when
12+
- we do not need to worry much about exchanging data with a heterogeneous distributed application environment.
13+
14+
Hadoop, on the other hand, covers use cases where
15+
16+
- communication is not the bottleneck, because computation takes much longer than communication (think Machine Learning), when
17+
- the environment is heterogeneous,
18+
- processes do not need to be organized in a special way and
19+
- the division of tasks into sub-problems can be done efficiently by just slicing the input data into equal-sized pieces, where
20+
- sub-problems have batch job character, where
21+
- data is unstructured (e.g., text) and potentially huge (eating away the advantages of MPI-style communication), or where
22+
- data comes from and results must be pushed back to other applications in the environment, say to HTTP/Java Servlet/Web Service stacks.
423

524
## 1. Examples
625

0 commit comments

Comments
 (0)