Technical Knowledge Base

“This section provides a comprehensive overview of various technologies and tools in the field of data engineering.”

Relational Databases

  • PostgreSQL: An open-source object-relational database system known for its reliability and robust features.
  • MySQL: The most popular open-source SQL database management system, developed by Oracle Corporation.
  • Amazon Relational Database System (RDS): A managed service for setting up and scaling databases in the cloud, supporting multiple database engines.

Columnar Databases

  • Amazon Redshift: A cloud data warehouse known for its fast performance, available in both provisioned and serverless configurations.
  • Google BigQuery: A serverless enterprise data warehouse that offers high scalability and cost-effectiveness.

Key-value Stores

  • Redis: An open-source, in-memory key-value store known for its versatility and performance.
  • Amazon DynamoDB: A fully managed, serverless NoSQL database designed for modern applications.

Object Storage

  • Amazon S3: Offers scalable and secure object storage services.
  • Azure Blob Storage: Ideal for storing large-scale cloud-native workloads, archives, and data lakes.
  • Google Cloud Storage: A flexible service for storing and retrieving any amount of data.

Data Ingestion

Data Formats

  • Apache Avro: A serialization format ideal for streaming data pipelines.
  • Apache Parquet: An open-source columnar data file format.
  • Apache ORC: Optimized for Hadoop workloads.

Data Storage Framework

  • Delta Lake: An open-source storage framework by Databricks for building Lakehouse architecture.
  • Apache Iceberg: An open table format for huge analytical datasets, developed by Netflix.
  • Apache Hudi: A framework for managing large analytical datasets, created by Uber.

Batch Processing

Frameworks and Libraries

  • Apache Spark: A versatile engine for big data processing, available in various languages like Python (PySpark), Scala, Java, and R.

SQL Engines

  • Presto: A distributed SQL query engine for big data.
  • Apache Hive: Built on Hadoop, it facilitates reading, writing, and managing large datasets.
  • Apache Drill: An open-source SQL query engine.
  • Trino: A SQL query engine designed for large data sets.

Managed Services (Cloud)

Stream Processing

Data Stores

Workflow Orchestration

  • Apache Airflow: An open-source platform for managing complex computational workflows and data processing pipelines.
  • Mage: A modern replacement for Airflow for transforming and integrating data.
  • Dagster: An orchestration platform for data assets.
  • Prefect: A workflow orchestration tool for data pipelines.
  • Kestra: An orchestrator for both scheduled and event-driven workflows.
  • AWS Step Functions: Coordinates components of distributed applications.

Data Transformation

Frameworks

  • Data Build Tool (dbt): A transformation workflow that follows software engineering best practices.
  • SQLMesh: An open-source data transformation framework for SQL and Python.

Data Governance

Enterprise Data Catalog

  • DataHub Project: An extensible metadata platform by LinkedIn.
  • OpenMetadata: A platform for discovering, collaborating, and managing data.
  • Apache Atlas: An open-source metadata and governance framework.
  • Amundsen: A data discovery and metadata engine by Lyft.

Data Quality/Observability

Data Platforms

  • Databricks: Offers a unified Data Lakehouse platform.
  • Snowflake: A cloud-native data warehouse platform.