-In a final step we discuss a technology which combines the ability to create large-scale distributed computations (from the MPI world) with the rich tool support of the Java ecosystem: [MapReduce](https://en.wikipedia.org/wiki/MapReduce) with [Apache](http://hadoop.apache.org/) [Hadoop](https://en.wikipedia.org/wiki/Apache_Hadoop). MPI is the technology of choice if communication is expensive and the bottleneck of our application, frequent communication is required between processes solving related sub-problems, the available hardware is homogenous, processes need to be organized in groups or topological structures to make efficient use of collective communication to achieve high performance, the size of data that needs to be transmitted is smaller in comparison to runtime of computations, and when we do not need to worry much about exchanging daa with a heterogeneous distributed application environment. Hadoop, on the other hand, covers use cases where communication is not the bottleneck, because computation takes much longer than communication (think Machine Learning), when the environment is heterogeneous, processes do not need to be organized in a special way and the division of tasks into sub-problems can be done efficiently by just slicing the input data into equal-sized pieces, where sub-problems have batch job character, where data is unstructured (e.g., text) and potentially huge (eating away the advantages of MPI-style communication), or where data comes from and results must be pushed back to other applications in the environment, say to HTTP/Java Servlet/Web Service stacks. Our [Hadoop examples](http://github.com/thomasWeise/distributedComputingExamples/tree/master/hadoop/) focus on the [MapReduce](https://en.wikipedia.org/wiki/MapReduce) pattern (which is a tiny little bit similar to scatter/gather/reduce in MPI, just for the scenario described above).
0 commit comments