Technical Knowledge Base

“This section provides a comprehensive overview of various technologies and tools in the field of data engineering.”

Relational Databases

PostgreSQL: An open-source object-relational database system known for its reliability and robust features.
MySQL: The most popular open-source SQL database management system, developed by Oracle Corporation.
Amazon Relational Database System (RDS): A managed service for setting up and scaling databases in the cloud, supporting multiple database engines.

Amazon Redshift: A cloud data warehouse known for its fast performance, available in both provisioned and serverless configurations.
Google BigQuery: A serverless enterprise data warehouse that offers high scalability and cost-effectiveness.

Redis: An open-source, in-memory key-value store known for its versatility and performance.
Amazon DynamoDB: A fully managed, serverless NoSQL database designed for modern applications.

Amazon S3: Offers scalable and secure object storage services.
Azure Blob Storage: Ideal for storing large-scale cloud-native workloads, archives, and data lakes.
Google Cloud Storage: A flexible service for storing and retrieving any amount of data.

Apache Kafka: A distributed event streaming platform with various implementations:
- Confluent’s Apache Kafka: A fully managed service with expert support.
- Amazon MSK: Amazon’s fully managed Kafka service.
AWS SDK for pandas (AWS Wrangler): Extends pandas library to AWS, allowing seamless integration with AWS data services.
AWS Kinesis: A cloud-based service for real-time data processing.
Airbyte: An open-source data integration platform for ELT pipelines.
Pentaho Data Integration (Kettle): Includes both a core data integration engine and a graphical user interface for defining jobs and transformations.

Delta Lake: An open-source storage framework by Databricks for building Lakehouse architecture.
Apache Iceberg: An open table format for huge analytical datasets, developed by Netflix.
Apache Hudi: A framework for managing large analytical datasets, created by Uber.

Apache Spark: A versatile engine for big data processing, available in various languages like Python (PySpark), Scala, Java, and R.

Presto: A distributed SQL query engine for big data.
Apache Hive: Built on Hadoop, it facilitates reading, writing, and managing large datasets.
Apache Drill: An open-source SQL query engine.
Trino: A SQL query engine designed for large data sets.

AWS Elastic MapReduce (EMR): A cloud big data platform for processing large datasets.
AWS Glue: A serverless data integration service.

Apache Airflow: An open-source platform for managing complex computational workflows and data processing pipelines.
Mage: A modern replacement for Airflow for transforming and integrating data.
Dagster: An orchestration platform for data assets.
Prefect: A workflow orchestration tool for data pipelines.
Kestra: An orchestrator for both scheduled and event-driven workflows.
AWS Step Functions: Coordinates components of distributed applications.

Data Build Tool (dbt): A transformation workflow that follows software engineering best practices.
SQLMesh: An open-source data transformation framework for SQL and Python.