Technical Knowledge Base
“This section provides a comprehensive overview of various technologies and tools in the field of data engineering.”
Relational Databases
- PostgreSQL: An open-source object-relational database system known for its reliability and robust features.
- MySQL: The most popular open-source SQL database management system, developed by Oracle Corporation.
- Amazon Relational Database System (RDS): A managed service for setting up and scaling databases in the cloud, supporting multiple database engines.
Columnar Databases
- Amazon Redshift: A cloud data warehouse known for its fast performance, available in both provisioned and serverless configurations.
- Google BigQuery: A serverless enterprise data warehouse that offers high scalability and cost-effectiveness.
Key-value Stores
- Redis: An open-source, in-memory key-value store known for its versatility and performance.
- Amazon DynamoDB: A fully managed, serverless NoSQL database designed for modern applications.
Object Storage
- Amazon S3: Offers scalable and secure object storage services.
- Azure Blob Storage: Ideal for storing large-scale cloud-native workloads, archives, and data lakes.
- Google Cloud Storage: A flexible service for storing and retrieving any amount of data.
Data Ingestion
- Apache Kafka: A distributed event streaming platform with various implementations:
- Confluent’s Apache Kafka: A fully managed service with expert support.
- Amazon MSK: Amazon’s fully managed Kafka service.
- AWS SDK for pandas (AWS Wrangler): Extends pandas library to AWS, allowing seamless integration with AWS data services.
- AWS Kinesis: A cloud-based service for real-time data processing.
- Airbyte: An open-source data integration platform for ELT pipelines.
- Pentaho Data Integration (Kettle): Includes both a core data integration engine and a graphical user interface for defining jobs and transformations.
Data Formats
- Apache Avro: A serialization format ideal for streaming data pipelines.
- Apache Parquet: An open-source columnar data file format.
- Apache ORC: Optimized for Hadoop workloads.
Data Storage Framework
- Delta Lake: An open-source storage framework by Databricks for building Lakehouse architecture.
- Apache Iceberg: An open table format for huge analytical datasets, developed by Netflix.
- Apache Hudi: A framework for managing large analytical datasets, created by Uber.
Batch Processing
Frameworks and Libraries
- Apache Spark: A versatile engine for big data processing, available in various languages like Python (PySpark), Scala, Java, and R.
SQL Engines
- Presto: A distributed SQL query engine for big data.
- Apache Hive: Built on Hadoop, it facilitates reading, writing, and managing large datasets.
- Apache Drill: An open-source SQL query engine.
- Trino: A SQL query engine designed for large data sets.
Managed Services (Cloud)
- AWS Elastic MapReduce (EMR): A cloud big data platform for processing large datasets.
- AWS Glue: A serverless data integration service.
Stream Processing
- Spark Streaming: A part of Apache Spark for processing live data streams.
- Spark Structured Streaming: A stream processing engine built on Spark SQL.
- Apache Flink: A framework for stateful computations over data streams.
- Apache Storm: A system for processing streaming data in real-time.
Data Stores
- Apache Druid: A high-performance real-time analytics database.
- Apache Pinot: A real-time distributed OLAP datastore.
Workflow Orchestration
- Apache Airflow: An open-source platform for managing complex computational workflows and data processing pipelines.
- Mage: A modern replacement for Airflow for transforming and integrating data.
- Dagster: An orchestration platform for data assets.
- Prefect: A workflow orchestration tool for data pipelines.
- Kestra: An orchestrator for both scheduled and event-driven workflows.
- AWS Step Functions: Coordinates components of distributed applications.
Data Transformation
Frameworks
- Data Build Tool (dbt): A transformation workflow that follows software engineering best practices.
- SQLMesh: An open-source data transformation framework for SQL and Python.
Data Governance
Enterprise Data Catalog
- DataHub Project: An extensible metadata platform by LinkedIn.
- OpenMetadata: A platform for discovering, collaborating, and managing data.
- Apache Atlas: An open-source metadata and governance framework.
- Amundsen: A data discovery and metadata engine by Lyft.
Data Quality/Observability
- Great Expectations: A platform for data quality and observability.
Data Platforms
- Databricks: Offers a unified Data Lakehouse platform.
- Snowflake: A cloud-native data warehouse platform.