Spark和Hadoop是友,非敵

Spark 在 6 月份取得了激動人心的成績。在圣何塞舉辦的 Hadoop 峰會上,Spark 成了人們經(jīng)常提及的話題和許多演講的主題。IBM 還在 6 月 15 號宣布,將對 Spark 相關(guān)的技術(shù)進(jìn)行巨額投資。

這一聲明幫助推動了舊金山 Spark 峰會 的召開。在這里,人們會看到有越來越多的工程師在學(xué)習(xí) Spark,也有越來越多的公司在試驗(yàn)和采用 Spark。

對 Spark 的投資和采用形成了一個(gè)正向循環(huán),迅速推動這一重要技術(shù)的成熟和發(fā)展,讓整個(gè)大數(shù)據(jù)社區(qū)受益。然而,人們對 Spark 的日益關(guān)注讓一些人產(chǎn)生了奇怪、固執(zhí)的誤解:即 Spark 能取代 Hadoop,而不是對 Hadoop 的補(bǔ)充。這一誤解從《公司紛紛拋棄大數(shù)據(jù)技術(shù) Hadoop》這樣的新聞標(biāo)題上就能看出來。

作為大數(shù)據(jù)長期踐行者、現(xiàn)任大數(shù)據(jù)即服務(wù)公司首席執(zhí)行官,我想就這一誤解發(fā)表看法,進(jìn)行一些澄清。

Spark 和 Hadoop 配合得很好。

Hadoop 正日益成為公司處理大數(shù)據(jù)的企業(yè)平臺之選。Spark 則是運(yùn)行在 Hadoop 之上的內(nèi)存中處理解決方案。Hadoop 最大的用戶(包括易趣和雅虎)都在自己的 Hadoop 集群中運(yùn)行 Spark。Cloudera 和 Hortonworks 在其 Hadoop 包中也加入了 Spark。我們 Altiscale 的客戶在我們最開始推出時(shí)就使用運(yùn)行著 Spark 的 Hadoop。

將 Spark 放到 Hadoop 的對立面就像是在說你的新電動車非??幔静恍枰娨粯?。但事實(shí)上,電動車會推動對更多電力的需求。

為什么會產(chǎn)生這種混淆?如今的 Hadoop 由兩大部分組成。第一部分是名為 Hadoop 分布式文件系統(tǒng)(HDFS)的大規(guī)模存儲系統(tǒng),該系統(tǒng)能高效、低成本地存儲數(shù)據(jù),且針對大數(shù)據(jù)的容量、多樣性和速度進(jìn)行了優(yōu)化。第二部分是名為 YARN 的計(jì)算引擎,該引擎能在 HDFS 存儲的數(shù)據(jù)上運(yùn)行大量并行程序。

YARN 能托管任意多的程序框架。最初的框架是由谷歌發(fā)明的 MapReduce,用來幫助處理海量網(wǎng)絡(luò)抓取數(shù)據(jù)。Spark 是另一個(gè)這樣的框架,還有一個(gè)名為 Tez 的新框架。當(dāng)人們談?wù)?Spark 與 Hadoop 的“對決”時(shí),他們實(shí)際上是在說現(xiàn)在程序員們更喜歡用 Spark 了,而非之前的 MapReduce 框架。

但是,MapReduce 不應(yīng)該和 Hadoop 等同起來。MapReduce 只是 Hadoop 集群處理數(shù)據(jù)的諸多方式之一。Spark 可以替代 MapReduce。商業(yè)分析們會避免使用這兩個(gè)本來是供程序員使用的底層框架。相反,他們運(yùn)用 SQL 等高級語言來更方便地使用 Hadoop。

在過去四年中,基于 Hadoop 的大數(shù)據(jù)技術(shù)涌現(xiàn)出了讓人目不暇接的創(chuàng)新。Hadoop 從批處理 SQL 進(jìn)化到了交互操作;從一個(gè)框架(MapReduce)變成了多個(gè)框架(如 MapReduce、Spark 等)。

HDFS 的性能和安全也得到了巨大改進(jìn),在這些技術(shù)之上出現(xiàn)了眾多工具,如 Datameer、H20 和 Tableau。這些工具極大地?cái)U(kuò)大了大數(shù)據(jù)基礎(chǔ)設(shè)施的用戶范圍,讓數(shù)據(jù)科學(xué)家和企業(yè)用戶也能使用。

Spark 不會取代 Hadoop。相反,Hadoop 是 Spark 的基石。隨著各個(gè)組織尋求運(yùn)用范圍最廣、最健壯的平臺來將自己的數(shù)據(jù)資產(chǎn)轉(zhuǎn)變?yōu)榭尚袆拥纳虡I(yè)洞見,它們對 Hadoop 和 Spark 技術(shù)的采用也會越來越多。

英語原文:

June was an exciting month for Apache Spark. At Hadoop Summit San Jose, it was a frequent topic of conversation, as well as the subject of many session presentations. On June 15, IBM announced plans to make a massive investment in Spark-related technology.

This announcement helped kick off the Spark Summit in San Francisco, where one could witness the increasing number of engineers learning about Spark — and the increasing number of companies experimenting with and adopting Spark.

The virtuous cycle of Spark investment and adoption is driving rapidly the maturity and capabilities of this important technology, to the benefit of the entire big data community. However, the growing attention directed toward Spark also has given rise to a strange and stubborn misconception: that Spark is somehow an alternative to Apache Hadoop, instead of a complement to it. This misconception can be seen in headlines like “Newer Software Aims to Crunch Hadoop’s Numbers” and “Companies Move On From Big Data Technology Hadoop.”

As a long-time big data practitioner, an early advocate for investment in Hadoop by Yahoo! and now CEO of a company that provides big data as a service for the enterprise, I’d like to bring some perspective and clarity to this conversation.

Spark and Hadoop work together.

Hadoop is increasingly the enterprise platform of choice for big data. Spark is an in-memory processing solution that runs on top of Hadoop. The largest users of Hadoop — including eBay and Yahoo! — both run Spark inside their Hadoop clusters. Cloudera and Hortonworks ship Spark as part of their Hadoop distributions. And our own customers here at Altiscale have been using Spark on Hadoop since we launched.

To position Spark in opposition to Hadoop is like saying that your new electric car is so cool that you won’t need electricity anymore. If anything, electric cars will drive demand for more electricity.

Why the confusion? Modern-day Hadoop consists of two main components. The first is a large-scale storage system called the Hadoop Distributed File System (HDFS), which stores data in a low-cost, high-performance manner optimized for the volume, variety and velocity of big data. The second component is a computation engine called YARN, which can run massively parallel programs on top of the data stored in HDFS.

YARN can host any number of programming frameworks. The original such framework was MapReduce, invented at Google to help process massive web crawls. Spark is another such framework, as is another new one called Tez. When people talk about Spark “crushing” Hadoop, what they really mean is that programmers now prefer using Spark to the older MapReduce framework.

However, MapReduce should not be equated with Hadoop. MapReduce is just one of many ways to process your data in a Hadoop cluster. Spark can be used as an alternative. Looking more broadly, business analysts — a growing base of big data practitioners — avoid both of these frameworks, which are low-level toolkits meant for programmers. Instead, they use high-level languages like SQL that make Hadoop more accessible.

In the last four years, Hadoop-based big data technology has seen an unprecedented level of innovation. We’ve gone from batch SQL to interactive; from one framework (MapReduce) to multiple frameworks (e.g., MapReduce, Spark and many others).

We’ve seen enormous performance and security improvements in HDFS, and we’ve seen an explosion of tools that sit on top of all of this — such as Datameer, H20 and Tableau — that make all of this big data infrastructure usable by a far broader range of data scientists and business users.

Spark isn’t a challenger that’s going to replace Hadoop. Rather, Hadoop is a foundation that makes Spark possible. We expect to see increasing adoption of both as organizations seek the broadest and most robust platform possible for turning their data assets into actionable business insight.

翻譯:1thinc0 via:techcrunch

End.

免責(zé)聲明:本網(wǎng)站內(nèi)容主要來自原創(chuàng)、合作伙伴供稿和第三方自媒體作者投稿,凡在本網(wǎng)站出現(xiàn)的信息,均僅供參考。本網(wǎng)站將盡力確保所提供信息的準(zhǔn)確性及可靠性,但不保證有關(guān)資料的準(zhǔn)確性及可靠性,讀者在使用前請進(jìn)一步核實(shí),并對任何自主決定的行為負(fù)責(zé)。本網(wǎng)站對有關(guān)資料所引致的錯誤、不確或遺漏,概不負(fù)任何法律責(zé)任。任何單位或個(gè)人認(rèn)為本網(wǎng)站中的網(wǎng)頁或鏈接內(nèi)容可能涉嫌侵犯其知識產(chǎn)權(quán)或存在不實(shí)內(nèi)容時(shí),應(yīng)及時(shí)向本網(wǎng)站提出書面權(quán)利通知或不實(shí)情況說明,并提供身份證明、權(quán)屬證明及詳細(xì)侵權(quán)或不實(shí)情況證明。本網(wǎng)站在收到上述法律文件后,將會依法盡快聯(lián)系相關(guān)文章源頭核實(shí),溝通刪除相關(guān)內(nèi)容或斷開相關(guān)鏈接。

2015-07-15
Spark和Hadoop是友,非敵
Spark 在 6 月份取得了激動人心的成績。在圣何塞舉辦的 Hadoop 峰會上,Spark 成了人們經(jīng)常提及的話題和許多演講的主題。IBM 還在 6 月 15 號宣布,將對 Spark 相關(guān)的技術(shù)

長按掃碼 閱讀全文