Data Engineering

387 readers
1 users here now

A community for discussion about data engineering

Icon base by Delapouite under CC BY 3.0 with modifications to add a gradient

founded 2 years ago
MODERATORS
1
 
 

I am creating a couple of bigger database tables with at least hundreds of millions of observations, but growing. Some tables are by minute, some by milliseconds. timestamps are not necessarily unique.

Should I create separate year, month, or date and time columns? Is one unique datetime column enough? At what size would you partition the tables?

Raw data is in csv.

Currently I aim for postgres and duckdb. Does timescaledb make a significant difference?

2
3
Fun with Hy and Pandas (benrutter.github.io)
submitted 2 months ago by [email protected] to c/data_engineering
 
 

cross-posted from: https://slrpnk.net/post/13881784

Hy (a lisp built on top of Python similar to how Clojure is built on top of Java) released v1 recently. I couldn't resist playing with it and found it worked sooo nicely. Thanks all the maintainers for creating a great language!

3
4
-4
Shift Left (medium.com)
submitted 4 months ago by [email protected] to c/data_engineering
 
 

Hey there Data Engineers. Want to stop putting out fires and start preventing them? Then it might be time to "shift left." By tackling quality, governance, and security from the get-go, you'll save time, money, and headaches.

If you want to learn more, follow the paywall bypassed link to my latest article. I hope some of you find this useful!

5
 
 

Book Preface:

Welcome to Apache Iceberg: The Definitive Guide! We’re delighted you have embarked on this learning journey with us. In this preface, we provide an overview of this book, why we wrote it, and how you can make the most of it.

About This Book

In these pages, you’ll learn what Apache Iceberg is, why it exists, how it works, and how to harness its power. Designed for data engineers, architects, scientists, and analysts working with large datasets across various use cases from BI dashboards to AI/ML, this book explores the core concepts, inner workings, and practical applications of Apache Iceberg. By the time you reach the end, you will have grasped the essentials and possess the practical knowledge to implement Apache Iceberg effectively in your data projects. Whether you are a newcomer or an experienced practitioner, Apache Iceberg: The Definitive Guide will be your trusted companion on this enlightening journey into Apache Iceberg.

Why We Wrote This Book

As we observed the rapid growth and adoption of the Apache Iceberg ecosystem, it became evident that a growing knowledge gap needed to be addressed. Initially, we began by sharing insights through a series of blog posts on the Dremio platform to provide valuable information to the burgeoning Iceberg community. However, it soon became clear that a comprehensive and centralized resource was essential to meet the increasing demand for a definitive Iceberg reference. This realization was the driving force behind the creation of Apache Iceberg: The Definitive Guide. Our goal is to provide readers with a single authoritative source that bridges the knowledge gap and empowers individuals and organizations to make the most of Apache Iceberg’s capabilities in their data-related endeavors.

What You Will Find Inside

In the following chapters, you will learn what Apache Iceberg is and how it works, how you can take advantage of the format with a variety of tools, and best practices to manage the quality and governance of the data in Apache Iceberg tables. Here is a summary of each chapter’s content:

  • Chapter 1, “Introduction to Apache Iceberg”
    Exploration of the historical context of data lakehouses and the essential concepts underlying Apache Iceberg.
  • Chapter 2, “The Architecture of Apache Iceberg”
    Deep dive into the intricate design of Apache Iceberg, examining how its various components function together.
  • Chapter 3, “Lifecycle of Write and Read Queries”
    Examination of the step-by-step process involved in Apache Iceberg transactions, highlighting updates, reads, and time-travel queries.
  • Chapter 4, “Optimizing the Performance of Iceberg Tables”
    Discussions on maintaining optimized performance in Apache Iceberg tables through techniques such as compaction and sorting.
  • Chapter 5, “Iceberg Catalogs”
    In-depth explanation of the role of Apache Iceberg catalogs, exploring the different catalog options available.
  • Chapter 6, “Apache Spark”
    Practical sessions using Apache Spark to manage and interact with Apache Iceberg tables.
  • Chapter 7, “Dremio’s SQL Query Engine”
    Exploration of the Dremio lakehouse platform, focusing on DDL, DML, and table optimization for Apache Iceberg tables.
  • Chapter 8, “AWS Glue”
    Demonstration of the use of AWS Glue Catalog and AWS Glue Studio for working with Apache Iceberg tables.
  • Chapter 9, “Apache Flink”
    Practical exercises in using Apache Flink for streaming data processing with Apache Iceberg tables.
  • Chapter 10, “Apache Iceberg in Production”
    Insights into managing data quality in production, using metadata tables for table health monitoring and employing table and catalog versioning for various operational needs.
  • Chapter 11, “Streaming with Apache Iceberg”
    Use of tools such as Apache Spark, Flink, and AWS Glue for streaming data processing into Iceberg tables.
  • Chapter 12, “Governance and Security”
    Exploration of the application of governance and security at various levels in Apache Iceberg tables, such as storage, semantic layers, and catalogs.
  • Chapter 13, “Migrating to Apache Iceberg”
    Guidelines on transforming existing datasets from different file types and databases into Apache Iceberg tables.
  • Chapter 14, “Real-World Use Cases of Apache Iceberg”
    A look at real-world applications of Apache Iceberg, including business intelligence dashboards and implementing change data capture

Direct link to PDF

Dremio bills itself as a "Unified Analytics Platform for a Self-Service Lakehouse". The authors of the book work for Dremio and may have ownership interest in Dremio.

6
 
 

July 18, 2024 Narek Galstyan writes:

We were naturally curious when we saw Pinecone's blog post comparing Postgres and Pinecone.

In their post on Postgres, Pinecone recognizes that Postgres is easy to start with as a vector database, since most developers are familiar with it. However, they argue that Postgres falls short in terms of quality. They describe issues with index size predictability, index creation resource intensity, metadata filtering performance, and cost.

This is a response to Pinecone's blog post, where we show that Postgres outperforms Pinecone in the same benchmarks with a few additional tweaks. We show that with just 20 lines of additional code, Postgres with the pgvector or lantern extension outperforms Pinecone by reaching 90% recall (compared to Pinecone's 60%) with under 200ms p95 latency.

Read Postgres vs. Pinecone

7
 
 

7/3/2024

Steven Wang writes:

Many in the data space are now aware of Iceberg and its powerful features that bring database-like functionality to files stored in the likes of S3 or GCS. But Iceberg is just one piece of the puzzle when it comes to transforming files in a data lake into a Lakehouse capable of analytical and ML workloads. Along with Iceberg, which is primarily a table format, a query engine is also required to run queries against the tables and schemas managed by Iceberg. In this post we explore some of the query engines available to those looking to build a data stack around Iceberg: Snowflake, Spark, Trino, and DuckDB.

...

DuckDB + Iceberg Example

We will be loading 12 months of NYC yellow cab trip data (April 2023 - April 2024) into Iceberg tables and demonstrating how DuckDB can query these tables.

Read Comparing Iceberg Query Engines

8
 
 

Let me share my post with a detailed step by step guide how an exisiting Spark scala library may be adopted to work with recently introduced Spark Connect. As an example I have chosen a pupular open source data quality tool AWS Deequ. I made all the necessary protobuf messages and a Spark Connect Plugin. I tested it from PySpark Connect 3.5.1 and it works. Of course, all the code is public in git.

9
 
 

Time and again I see the same questions asked: "Why should I use dbt?" or "I don't understand what value dbt offers". So I thought I'd put together an article that touches on some of the benefits, as well as putting together a step through on setting up a new project (using DuckDB as the database), complete with associated GitHub repo for you to take a look at.

Having used dbt since early 2018, and with my partner being a dbt trainer, I hope that this article is useful for some of you. The link is paywall bypassed.

10
 
 

If you're a Data Engineer, before long you'll be asked to build a real-time pipeline.

In my latest article, I build a real-time pipeline using Kafka, Polars and Delta tables to demonstrate how these can work together. Everything is available to try yourself in the associated GitHub repo. So if you're curious, take a moment to check out this technical post.

11
 
 

How often do you build and edit Entity Relationship Diagrams? If the answer is ‘more often than I’d like’, and you’re fed up with tweaking your diagrams, take <5 minutes to read my latest article on building your diagrams with code. Track their changes in GitHub, have them build as part of your CI/CD pipeline, and even drop them into your dbt docs if you like.

This is a ‘friends and family’ link, so it’ll bypass the usual Medium paywall.

I’m not affiliated to the tool I’ve chosen in any way. Just like how it works.

Let me know yours thoughts!

12
13
 
 

Mar 8, 2024 | Hakampreet Singh Pandher writes:

Yelp relies heavily on streaming to synchronize enormous volumes of data in real time. This is facilitated by Yelp’s underlying data pipeline infrastructure, which manages the real-time flow of millions of messages originating from a plethora of services. This blog post covers how we leverage Yelp’s extensive streaming infrastructure to build robust data abstractions for our offline and streaming data consumers. We will use Yelp’s Business Properties ecosystem (explained in the upcoming sections) as an example.

Read Building data abstractions with streaming at Yelp

14
 
 

I’ve written a series of Medium articles on creating a Data Pipeline from scratch, using Polars and DeltaTables. The first (linked) is an overview with link to the GitHub repository and each of the deeper dive articles. I then go into the next level of detail, walking through each component.

The articles are paywalled (it took time to build and document), but the link provided is the ‘family & friends’ link which bypasses the paywall for the Lemmy community.

I hope some of you may find this helpful.

15
 
 

Hello,

I am looking for some advice to help me out at my job. Apologies if this is the wrong place to ask.

So, basically my boss is a complete technophobe and all of our data is stored across multiple excel files in drop box and I'm looking for a way to change that into a centralized database. I know my way around a computer but writing code is not something I have ever been able to grasp well.

The main issue with our situation is that our workers are all completely remote, and no I don't mean work from home in the suburbs from a home office. They use little laptops with no data connection and go out gathering data every day from a variety of locations, sometimes not even cell coverage.

We need up to 20 people entering data all day long and then updating a centralized database at the end of the day when they get back home and have internet connection. It will generally all be new entries, no one will need to be updating old entries.

It would be nice to have some sort of data entry form in drop box and a centralized database on our local server at head office which pulls the data at the end of each day. Field workers would also need access to certain data such as addresses, contact info, maps, photos, historical data, etc. but not all of it. For example the worker in City A only needs access to the historical data from records in and around City A, and workers in City B only need access to records involving City B.

Is there any recommended options for software which can achieve this? It needs to be relatively user friendly and simple as our workers are typically biology oriented summer students, not programmers.

16
 
 

Apple donated to community their own implementation of native physical execution of Apache Spark plan with Data Fusion.

17
 
 

A few years ago, if you'd mentioned Infrastructure-as-Code (IaC) to me, I would've given you a puzzled look. However I'm now on the bandwagon. And to help others understand how it can benefit them, I've pulled together a simple GitHub repo that showcases how Terraform can be used with Snowflake to manage users, roles, warehouses and databases.

The readme hopefully gives anyone who wants to give it a go the ability to step through and see results. I'm sharing this in the hopes that it is useful to some of you.

18
 
 

December 28 2023 Pankaj Singh writes:

In big data processing and analytics, choosing the right tool is paramount for efficiently extracting meaningful insights from vast datasets. Two popular frameworks that have gained significant traction in the industry are Apache Spark and Presto. Both are designed to handle large-scale data processing efficiently, yet they have distinct features and use cases. As organizations grapple with the complexities of handling massive volumes of data, a comprehensive understanding of Spark and Presto’s nuances and distinctive features becomes essential. In this article, we will compare Spark vs Presto, exploring their performance and scalability, data processing capabilities, ecosystem, integration, and use cases and applications.

Read Spark vs Presto: A Comprehensive Comparison

19
 
 

Hi all,

For those wanting a quick repo to use as a basis to get started, I’ve created jen-ai.

There are full instructions in the readme. Once running you can talk to it, and it will respond.

It’s basic, but a place to start.

20
21
10
I'd like to Volunteer to Moderate (self.data_engineering)
submitted 11 months ago* (last edited 11 months ago) by ericjmorey to c/data_engineering
 
 

Since there's only one mod here and they are also an Admin. I'd like to volunteer to moderate this community.

22
 
 

Karl W. Broman & Kara H. Woo write:

Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this article offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, do not include calculations in the raw data files, do not use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text files.

Read Data Organization in Spreadsheets

This article is weird in that it appears to be written for an audience that would find its contents irrelevant, but it has great information for people that are trying to reduce or eliminate their use of spreadsheets.

23
 
 

cross-posted from: https://programming.dev/post/8246313

Data science managers and leaders should make sure that cooperative work on models is facilitated and streamlined. In this post, our very own Shachaf Poran, PhD suggests one method of doing so.

24
 
 

Posted on 2023-12-15 by Tony Solomonik

About a year ago, I tried thinking which database I should choose for my next project, and came to the realization that I don't really know the differences of databases enough. I went to different database websites and saw mostly marketing and words I don't understand.

This is when I decided to read the excellent books Database Internals by Alex Petrov and Designing Data-Intensive Applications by Martin Kleppmann.

The books piqued my curiosity enough to write my own little database I called dbeel.

This post is basically a short summary of these books, with a focus on the fundamental problems a database engineer thinks about in the shower.

Read Database Fundamentals

25
 
 

Analytics Data Storage

Over the past 40 years, businesses have had a common problem- the data trying to create analytics information off raw application data doesn't work well. The format isn't ideal for analytics tools, and analytics workloads can cause spikes in performance for critical applications, and the data for a single report could come from many different sources. AI has worsened the issue since it needs another type of data formatting and access. The primary solution has been to copy the data into a separate storage solution better suited to analytics and AI needs.

Data Warehouses

Data warehouses are large, centralized repositories for storing, managing, and analyzing vast amounts of structured data from various sources. They are designed to support efficient querying and reporting, providing businesses with crucial insights for informed decision-making. Data warehouses utilize a schema-based design, often employing star or snowflake schemas, which organize data in tables related by keys. They are usually built using SQL engines. Small data warehouses can easily be created on the same software as applications, but specialized SQL engines like Amazon Redshift, Google BigQuery, and Snowflake are designed to handle analytics data on a much larger scale.

The main goal of a data warehouse is to support Online Analytical Processing (OLAP), allowing users to perform multidimensional data analysis, exploring it from different perspectives and dimensions. The primary issue with this format is that they aren't well suited for machine learning applications because the data access options don't work with machine learning tools at scale.

Star Schemas

Star and snowflake schemas are similar- a snowflake schema is essentially a star schema that allows for additional complexity. The name comes from the fact that diagrams of the various tables and their connections look like stars and snowflakes.

Data is divided into two types of tables: fact tables and dimensions. Dimension tables contain all the data common across events. For a specific sale at a retailer, there might be tables for customer information, the sale date, the order status, and the products sold. Those would be linked to a single fact table containing data specific to that sale, like the purchase and retail prices.

Data Lakes

Data warehouses had three main failings that required the creation of data lakes.

  1. The data isn't easily accessible for ML tools.
  2. Data Warehouses don't handle unstructured data like images and audio very well.
  3. Data storage costs are much higher than what would become data lakes.

Data lakes are usually built on top of an implementation of HDFS or a Hadoop Distributed File System. Previous file storage options had limits on how much data they could store, and large companies were starting to exceed those limits. HDFS effectively removes those limits since existing technology can handle data storage at a significantly larger scale than any current data lake requirements, and it is possible to expand this in the future.

Data warehouses are considered "schema on write," where the data is organized into tables before it is written to storage. Data lakes are stored in files that are primarily "schema on read" where the data may only be partially structured in the loading process and transformations are applied when a system goes to read the data.

Data lakes are often structured in layers. The first layer is the data in its original, raw format. The following layers will have increasing levels of structure. For example, an image might be in its raw form in the first layer, but later layers may have information about the image instead. The new file might have the image's metadata, a link to the original image, and an AI-generated description of what it contains.

Data lakes solved the scale problem and are more useful for machine learning, but they didn't have the same quality controls over that data that warehouses do, and "schema on read" creates significant performance problems. For the past decade, companies have maintained a complex ecosystem of data lakes, warehouses, and real-time analytics engines like Kafka. This supports analytics and ML, but creating, maintaining, and processing data is exceptionally time-consuming with many systems involved.

Data Lakehouse

Data Lakehouse is an emerging style of data storage that attempts to combine the benefits of both data lakes and data warehouses. The foundation of a lakehouse is a data lake to support the exabytes of data that data lakes can currently contain.

The first innovation in this direction is technologies like delta lake, which combines Apache Spark for data processing with parquet file formats to create data lake layers that support transactions and data quality controls while maintaining a compact data format. This tool is ideal for ML uses since it solves the data quality problems of data lakes and the data access problems from data warehouses. Current work focuses on allowing a lakehouse to provide more effective analytics data with tools like caching and indexes.

view more: next ›