My Tongue To Please you

You lay quietly as you watch and i crawl up to you Than on top you. Looking into each other eyes no words spoken But my eyes telling you a story about how much i want you While mines explore what is…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Introduction to Spark

Before touching Spark, there are several important terminologies need to be explained.

Let’s start with Hadoop Distributed File System (HDFS).

HDFS is consisted from 2 parts, namenode(master) and datanode(slave). Namenode manages the metadata, while datanode is the place saving real data. Metadata are a set of data that described other data. As you can see, clients’ data are stored in multiple datanodes because of the safety concern.

Second, let’s talk about MapReduce.

There are 2 common computation strategy in big data’s world: batch processing and streaming processing. Batch processing processes stored data, while streaming processing can working on continuously received data. To be more specific, log analysis and web page analysis are some typical examples of batch processing, and capturing users’ click on web pages are streaming processing.

MapReduce is a classical model to deal with batch processing, because the algorithm can divide problems into pieces and does parallel computations in the same time.

Let’s take a look on a simple example and see how MapReduce works.

Suppose we have a list of billions of words, and we want to count each word’s occurrence. What’s the best way to do it?

First, label each words and set value as 1. (Map Step)

Second, the computer will automatically group words by their labels and send data to Reduce section. (Shuffle Step, which is hidden for programmer)

Third, count the occurrence for each group and merge the data to get final results. (Reduce step)

Second example is terasort(sort billions of data).

First, labeled each word according to their first character. e.g the label of word ‘student’ is ‘s’. All items in group i+1 are greater than all items in group i. We can use trie tree to quickly label words.

Second, computer will automatically group the data according to their labels and send the data to reduce section. (shuffle)

Third, sort each group and combine them together. Since each group is ascending, so no need to merge groups.

Finally, let’s pay attention to Spark.

Spark is based on MapReduce, but unlike MapReduce, the engine utilizes mainly use memory and avoid using disk.

Speed: spark is 100 times faster in memory and 10 times faster in drive

Dependency: spark can work on its own while Hadoop is required MapReduce

Easy to use: thanks to APIs, Spark is much easier to use

RDD(Resilient Distributed Dataset) is a key concepts in Spark ecosystem. It is an immutable collection of objects that can be operated in parallel. Why is resilient? Basically, lineage information is saved in memory, and you can quickly recreate data from parent RDD. It greatly increase the fault tolerance. Each RDD is consisted from some partitions. The more partitions are, the more parallel you can do. There are 2 types of operation in RDD: transformation and action. Transformation is creating a new RDD from existing one, action is calculation and return values to driver. Transformation itself is “lazy”, because it doesn’t happen until you call actions.

In recent days, people tend to use Spark SQL, instead of RDD directly. Spark SQL is built on RDD. It is easy to use and more efficient compared to RDD, because it optimizes the calculation process.

A dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs(strong typing, ability to use lambda functions) with the benefits of Spark SQL’s optimized engine.

A dataframe is a dataset into named columns. They can be constructed from a wide array of sources such as: structured data files, tables in hive, external databases, or existing RDDs.

Add a comment

Related posts:

Salon Business Optimization Software

Facing the customers negligence at your salon? Upgrade with the digitized beauty and hair salon app. Without the mobile apps, today’s businesses are just like extinct animal. We have came up with new…

Reading as a child

One Sunday evening found me browsing the shelves of a high school library where I chanced upon a book titled Bird by Bird. What exactly caught my eye about that book, I don’t remember, but reading…