Data Engineering

433 readers

1 users here now

A community for discussion about data engineering

Icon base by Delapouite under CC BY 3.0 with modifications to add a gradient

founded 2 years ago

MODERATORS

ericjmorey

Spark vs Presto: A Comprehensive Comparison (www.analyticsvidhya.com)

submitted 1 year ago by ericjmorey to c/data_engineering

6 comments fedilink hide all child comments

December 28 2023 Pankaj Singh writes:

In big data processing and analytics, choosing the right tool is paramount for efficiently extracting meaningful insights from vast datasets. Two popular frameworks that have gained significant traction in the industry are Apache Spark and Presto. Both are designed to handle large-scale data processing efficiently, yet they have distinct features and use cases. As organizations grapple with the complexities of handling massive volumes of data, a comprehensive understanding of Spark and Presto’s nuances and distinctive features becomes essential. In this article, we will compare Spark vs Presto, exploring their performance and scalability, data processing capabilities, ecosystem, integration, and use cases and applications.

Read Spark vs Presto: A Comprehensive Comparison

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 1 year ago (1 children)

An article requires registration with google or with email. Without registration there is a paywall.

[–] ericjmorey 1 points 1 year ago (1 children)

Strange. I was able to view the article without either. I also tried in a private browser on both mobile and desktop.

[–] [email protected] 1 points 1 year ago (1 children)

I'm getting this paywall that cannot be skipped anyhow.

screenshot

[–] ericjmorey 2 points 1 year ago (1 children)

I don't know what's bypassing it on my setups, but it's on the wayback machine.

https://web.archive.org/web/2/https://www.analyticsvidhya.com/blog/2023/12/spark-vs-presto-a-comprehensive-comparison/

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago) (1 children)

Thank you! The conclusion is quite good, like use spark as ETL and Presto (Trino) for analytical queries but the article looks very outdated.

Spark is not about RDDs. Today the most usage of Spark is via DataFrame API. And it is not just syntax. The Catalyst itslef provide a lot of performance optimizations, like predicate pushdown on the level of orc/parquet reading, automatic skew joins detection, prunning, etc.

Also Presto in this case should be called as Trino because there was a rebranding in 2020

[–] ericjmorey 1 points 1 year ago

I was a questioning the quality of the source, thanks for confirming that it's not a top quality article.