Spark 在 6 月份取得了激動人心的成績。在圣何塞舉辦的 Hadoop 峰會上,Spark 成了人們經常提及的話題和許多演講的主題。IBM 還在 6 月 15 號宣布,將對 Spark 相關的技術進行巨額投資。
這一聲明幫助推動了舊金山 Spark 峰會 的召開。在這里,人們會看到有越來越多的工程師在學習 Spark,也有越來越多的公司在試驗和采用 Spark。
對 Spark 的投資和采用形成了一個正向循環(huán),迅速推動這一重要技術的成熟和發(fā)展,讓整個大數據社區(qū)受益。然而,人們對 Spark 的日益關注讓一些人產生了奇怪、固執(zhí)的誤解:即 Spark 能取代 Hadoop,而不是對 Hadoop 的補充。這一誤解從《公司紛紛拋棄大數據技術 Hadoop》這樣的新聞標題上就能看出來。
作為大數據長期踐行者、現任大數據即服務公司首席執(zhí)行官,我想就這一誤解發(fā)表看法,進行一些澄清。
Spark 和 Hadoop 配合得很好。
Hadoop 正日益成為公司處理大數據的企業(yè)平臺之選。Spark 則是運行在 Hadoop 之上的內存中處理解決方案。Hadoop 最大的用戶(包括易趣和雅虎)都在自己的 Hadoop 集群中運行 Spark。Cloudera 和 Hortonworks 在其 Hadoop 包中也加入了 Spark。我們 Altiscale 的客戶在我們最開始推出時就使用運行著 Spark 的 Hadoop。
將 Spark 放到 Hadoop 的對立面就像是在說你的新電動車非???,根本不需要電一樣。但事實上,電動車會推動對更多電力的需求。
為什么會產生這種混淆?如今的 Hadoop 由兩大部分組成。第一部分是名為 Hadoop 分布式文件系統(tǒng)(HDFS)的大規(guī)模存儲系統(tǒng),該系統(tǒng)能高效、低成本地存儲數據,且針對大數據的容量、多樣性和速度進行了優(yōu)化。第二部分是名為 YARN 的計算引擎,該引擎能在 HDFS 存儲的數據上運行大量并行程序。
YARN 能托管任意多的程序框架。最初的框架是由谷歌發(fā)明的 MapReduce,用來幫助處理海量網絡抓取數據。Spark 是另一個這樣的框架,還有一個名為 Tez 的新框架。當人們談論 Spark 與 Hadoop 的“對決”時,他們實際上是在說現在程序員們更喜歡用 Spark 了,而非之前的 MapReduce 框架。
但是,MapReduce 不應該和 Hadoop 等同起來。MapReduce 只是 Hadoop 集群處理數據的諸多方式之一。Spark 可以替代 MapReduce。商業(yè)分析們會避免使用這兩個本來是供程序員使用的底層框架。相反,他們運用 SQL 等高級語言來更方便地使用 Hadoop。
在過去四年中,基于 Hadoop 的大數據技術涌現出了讓人目不暇接的創(chuàng)新。Hadoop 從批處理 SQL 進化到了交互操作;從一個框架(MapReduce)變成了多個框架(如 MapReduce、Spark 等)。
HDFS 的性能和安全也得到了巨大改進,在這些技術之上出現了眾多工具,如 Datameer、H20 和 Tableau。這些工具極大地擴大了大數據基礎設施的用戶范圍,讓數據科學家和企業(yè)用戶也能使用。
Spark 不會取代 Hadoop。相反,Hadoop 是 Spark 的基石。隨著各個組織尋求運用范圍最廣、最健壯的平臺來將自己的數據資產轉變?yōu)榭尚袆拥纳虡I(yè)洞見,它們對 Hadoop 和 Spark 技術的采用也會越來越多。
英語原文:
June was an exciting month for Apache Spark. At Hadoop Summit San Jose, it was a frequent topic of conversation, as well as the subject of many session presentations. On June 15, IBM announced plans to make a massive investment in Spark-related technology.
This announcement helped kick off the Spark Summit in San Francisco, where one could witness the increasing number of engineers learning about Spark — and the increasing number of companies experimenting with and adopting Spark.
The virtuous cycle of Spark investment and adoption is driving rapidly the maturity and capabilities of this important technology, to the benefit of the entire big data community. However, the growing attention directed toward Spark also has given rise to a strange and stubborn misconception: that Spark is somehow an alternative to Apache Hadoop, instead of a complement to it. This misconception can be seen in headlines like “Newer Software Aims to Crunch Hadoop’s Numbers” and “Companies Move On From Big Data Technology Hadoop.”
As a long-time big data practitioner, an early advocate for investment in Hadoop by Yahoo! and now CEO of a company that provides big data as a service for the enterprise, I’d like to bring some perspective and clarity to this conversation.
Spark and Hadoop work together.
Hadoop is increasingly the enterprise platform of choice for big data. Spark is an in-memory processing solution that runs on top of Hadoop. The largest users of Hadoop — including eBay and Yahoo! — both run Spark inside their Hadoop clusters. Cloudera and Hortonworks ship Spark as part of their Hadoop distributions. And our own customers here at Altiscale have been using Spark on Hadoop since we launched.
To position Spark in opposition to Hadoop is like saying that your new electric car is so cool that you won’t need electricity anymore. If anything, electric cars will drive demand for more electricity.
Why the confusion? Modern-day Hadoop consists of two main components. The first is a large-scale storage system called the Hadoop Distributed File System (HDFS), which stores data in a low-cost, high-performance manner optimized for the volume, variety and velocity of big data. The second component is a computation engine called YARN, which can run massively parallel programs on top of the data stored in HDFS.
YARN can host any number of programming frameworks. The original such framework was MapReduce, invented at Google to help process massive web crawls. Spark is another such framework, as is another new one called Tez. When people talk about Spark “crushing” Hadoop, what they really mean is that programmers now prefer using Spark to the older MapReduce framework.
However, MapReduce should not be equated with Hadoop. MapReduce is just one of many ways to process your data in a Hadoop cluster. Spark can be used as an alternative. Looking more broadly, business analysts — a growing base of big data practitioners — avoid both of these frameworks, which are low-level toolkits meant for programmers. Instead, they use high-level languages like SQL that make Hadoop more accessible.
In the last four years, Hadoop-based big data technology has seen an unprecedented level of innovation. We’ve gone from batch SQL to interactive; from one framework (MapReduce) to multiple frameworks (e.g., MapReduce, Spark and many others).
We’ve seen enormous performance and security improvements in HDFS, and we’ve seen an explosion of tools that sit on top of all of this — such as Datameer, H20 and Tableau — that make all of this big data infrastructure usable by a far broader range of data scientists and business users.
Spark isn’t a challenger that’s going to replace Hadoop. Rather, Hadoop is a foundation that makes Spark possible. We expect to see increasing adoption of both as organizations seek the broadest and most robust platform possible for turning their data assets into actionable business insight.
翻譯:1thinc0 via:techcrunch
End.
- 蜜度索驥:以跨模態(tài)檢索技術助力“企宣”向上生長
- SUSE預測:未來私有AI平臺崛起,讓我們共同見證AI的未來
- AI伴侶“小奇”:奇富科技重塑金融服務體驗的探索之作
- 揭秘軟銀孫正義神秘芯片計劃:打造超越NVIDIA的未來科技新星
- 大模型創(chuàng)企星辰資本獲數億融資,騰訊啟明等巨頭入局,人工智能新篇章開啟
- 大模型獨角獸階躍星辰融資新動態(tài):數億美金B(yǎng)輪,揭秘星辰未來之路
- 哪吒汽車創(chuàng)始人資金遭凍結,1986萬元股權風波引關注
- 本田與日產醞釀合并:明年6月敲定協(xié)議,新公司社長待本田推薦
- 金融大模型新突破:百川智能Baichuan4-Finance引領行業(yè),準確率領先GPT-4近20%,變革金融業(yè)未來
- 博通CEO陳福陽回應:忙于AI半導體業(yè)務,暫無意收購英特爾,拒絕巨頭誘惑?
- 法拉第未來再獲融資,下月亮相首款原型車,或將開啟新篇章
免責聲明:本網站內容主要來自原創(chuàng)、合作伙伴供稿和第三方自媒體作者投稿,凡在本網站出現的信息,均僅供參考。本網站將盡力確保所提供信息的準確性及可靠性,但不保證有關資料的準確性及可靠性,讀者在使用前請進一步核實,并對任何自主決定的行為負責。本網站對有關資料所引致的錯誤、不確或遺漏,概不負任何法律責任。任何單位或個人認為本網站中的網頁或鏈接內容可能涉嫌侵犯其知識產權或存在不實內容時,應及時向本網站提出書面權利通知或不實情況說明,并提供身份證明、權屬證明及詳細侵權或不實情況證明。本網站在收到上述法律文件后,將會依法盡快聯(lián)系相關文章源頭核實,溝通刪除相關內容或斷開相關鏈接。