The Azure Data Engineer Associate Training (DP-203) focuses on data integration, transformation, and storage using Azure services. Below are the topics covered in this course.
Topics
Skill: Design and Implement Data Storage
Part 1: Implement a Partition Strategy
Session 1: Partitioning Strategies
- Implement a partition strategy for files
- Implement a partition strategy for analytical workloads
- Implement a partition strategy for streaming workloads
- Implement a partition strategy for Azure Synapse Analytics
Session 2: Partitioning in Azure Data Lake Storage Gen2
- Identify when partitioning is needed in Azure Data Lake Storage Gen2

Part 2: Design and Implement the Data Exploration Layer
Skill: Develop Data Processing
Part 1: Ingest and Transform Data
Session 6: Data Transformation
- Design and implement incremental data loads
- Transform data using Apache Spark, T-SQL in Azure Synapse Analytics, Azure Synapse Pipelines, Azure Data Factory, and Azure Stream Analytics
- Cleanse data, handle duplicate data, and ensure exactly-once delivery in Azure Stream Analytics
Session 7: Data Preparation Techniques
- Handle missing and late-arriving data
- Split data, shred JSON, encode and decode data
- Configure error handling for transformations
- Normalize and denormalize data
- Perform data exploratory analysis
Part 2: Develop a Batch Processing Solution
Session 8: Batch Processing Tools and Techniques
- Develop batch processing solutions using Azure Data Lake Storage Gen2, Azure Databricks, Azure Synapse Analytics, and Azure Data Factory
- Use PolyBase to load data into a SQL pool
- Implement Azure Synapse Link and query replicated data
Session 9: Pipeline Development
- Create and manage data pipelines
- Scale resources and configure batch size
- Integrate Jupyter or Python notebooks into a pipeline
- Configure exception handling and retention policies
Session 10: Delta Lake Operations
- Read from and write to a delta lake
- Upsert batch data and revert data to a previous state
Part 3: Develop a Stream Processing Solution
Session 11: Stream Processing Tools
- Create stream processing solutions using Stream Analytics and Azure Event Hubs
- Process data using Spark structured streaming
- Create windowed aggregates and handle schema drift
Session 12: Stream Data Processing Techniques
- Process time-series data, cross-partition data, and data within a single partition
- Configure checkpoints and watermarking
- Scale resources and optimize pipelines
Session 13: Stream Operations
- Handle interruptions and exceptions
- Replay archived stream data
- Read from and write to a delta lake
Part 4: Manage Batches and Pipelines
Session 14: Pipeline Management
- Trigger and validate batch loads
- Handle failed batch loads
- Schedule and manage data pipelines in Azure Data Factory and Azure Synapse Pipelines
- Implement version control for pipeline artifacts
Session 15: Spark Job Integration
- Manage Spark jobs within a pipeline
Skill: Secure, Monitor, and Optimize Data Storage and Data Processing
Part 1: Implement Data Security
Session 16: Security Features
- Implement data masking, encryption at rest and in motion, and row-level and column-level security
- Implement Azure RBAC and POSIX-like ACLs for Data Lake Storage Gen2
- Configure secure endpoints and resource tokens in Azure Databricks
- Load DataFrames with sensitive information
- Write encrypted data to tables or Parquet files
Part 2: Monitor Data Storage and Processing
Session 17: Monitoring Services
- Implement logging for Azure Monitor
- Configure monitoring services for stream processing and pipeline performance
- Measure the performance of data movement and query execution
- Schedule and monitor pipeline tests
- Interpret Azure Monitor metrics and logs
Session 18: Alerts and Reporting
- Implement a pipeline alert strategy
Part 3: Optimize and Troubleshoot Data Storage and Processing
Session 19: Optimization Techniques
- Compact small files and handle skew in data
- Optimize resource management and query performance using indexers and cache
Session 20: Troubleshooting
- Troubleshoot failed Spark jobs and pipeline runs, including external service activities