Data Engineering

433 readers

1 users here now

A community for discussion about data engineering

Icon base by Delapouite under CC BY 3.0 with modifications to add a gradient

founded 2 years ago

MODERATORS

ericjmorey

Dremio is offering free pdf copies of "Apache Iceberg: The Definitive Guide: Data Lakehouse Functionality, Performance and Scalability on the Data Lake" (hello.dremio.com)

submitted 8 months ago* (last edited 8 months ago) by ericjmorey to c/data_engineering

1 comments fedilink hide all child comments

Book Preface:

Welcome to Apache Iceberg: The Definitive Guide! We’re delighted you have embarked on this learning journey with us. In this preface, we provide an overview of this book, why we wrote it, and how you can make the most of it.

About This Book

In these pages, you’ll learn what Apache Iceberg is, why it exists, how it works, and how to harness its power. Designed for data engineers, architects, scientists, and analysts working with large datasets across various use cases from BI dashboards to AI/ML, this book explores the core concepts, inner workings, and practical applications of Apache Iceberg. By the time you reach the end, you will have grasped the essentials and possess the practical knowledge to implement Apache Iceberg effectively in your data projects. Whether you are a newcomer or an experienced practitioner, Apache Iceberg: The Definitive Guide will be your trusted companion on this enlightening journey into Apache Iceberg.

Why We Wrote This Book

As we observed the rapid growth and adoption of the Apache Iceberg ecosystem, it became evident that a growing knowledge gap needed to be addressed. Initially, we began by sharing insights through a series of blog posts on the Dremio platform to provide valuable information to the burgeoning Iceberg community. However, it soon became clear that a comprehensive and centralized resource was essential to meet the increasing demand for a definitive Iceberg reference. This realization was the driving force behind the creation of Apache Iceberg: The Definitive Guide. Our goal is to provide readers with a single authoritative source that bridges the knowledge gap and empowers individuals and organizations to make the most of Apache Iceberg’s capabilities in their data-related endeavors.

What You Will Find Inside

In the following chapters, you will learn what Apache Iceberg is and how it works, how you can take advantage of the format with a variety of tools, and best practices to manage the quality and governance of the data in Apache Iceberg tables. Here is a summary of each chapter’s content:

Chapter 1, “Introduction to Apache Iceberg”
Exploration of the historical context of data lakehouses and the essential concepts underlying Apache Iceberg.

Chapter 2, “The Architecture of Apache Iceberg”
Deep dive into the intricate design of Apache Iceberg, examining how its various components function together.

Chapter 3, “Lifecycle of Write and Read Queries”
Examination of the step-by-step process involved in Apache Iceberg transactions, highlighting updates, reads, and time-travel queries.

Chapter 4, “Optimizing the Performance of Iceberg Tables”
Discussions on maintaining optimized performance in Apache Iceberg tables through techniques such as compaction and sorting.

Chapter 5, “Iceberg Catalogs”
In-depth explanation of the role of Apache Iceberg catalogs, exploring the different catalog options available.

Chapter 6, “Apache Spark”
Practical sessions using Apache Spark to manage and interact with Apache Iceberg tables.

Chapter 7, “Dremio’s SQL Query Engine”
Exploration of the Dremio lakehouse platform, focusing on DDL, DML, and table optimization for Apache Iceberg tables.

Chapter 8, “AWS Glue”
Demonstration of the use of AWS Glue Catalog and AWS Glue Studio for working with Apache Iceberg tables.

Chapter 9, “Apache Flink”
Practical exercises in using Apache Flink for streaming data processing with Apache Iceberg tables.

Chapter 10, “Apache Iceberg in Production”
Insights into managing data quality in production, using metadata tables for table health monitoring and employing table and catalog versioning for various operational needs.

Chapter 11, “Streaming with Apache Iceberg”
Use of tools such as Apache Spark, Flink, and AWS Glue for streaming data processing into Iceberg tables.

Chapter 12, “Governance and Security”
Exploration of the application of governance and security at various levels in Apache Iceberg tables, such as storage, semantic layers, and catalogs.

Chapter 13, “Migrating to Apache Iceberg”
Guidelines on transforming existing datasets from different file types and databases into Apache Iceberg tables.

Chapter 14, “Real-World Use Cases of Apache Iceberg”
A look at real-world applications of Apache Iceberg, including business intelligence dashboards and implementing change data capture

Direct link to PDF

Dremio bills itself as a "Unified Analytics Platform for a Self-Service Lakehouse". The authors of the book work for Dremio and may have ownership interest in Dremio.

top 1 comments

sorted by: hot top controversial new old

[–] jim 2 points 8 months ago

Nice! I'm a big fan of Iceberg, and it's nice to see books coming out for it. I used it quite a bit with Spark, and it's a pleasure to use.

I'm waiting for the python support to be complete, and I can see myself using it full time. Right now, I'm trying to use DuckDB and Python for nearly everything outside of the database, and the only thing missing is good Python support for Iceberg.