AZ-305 Azure Solutions Architect Expert learning pathways (part 3)
The last two learning paths of the AZ-305 exam take a closer look at the Azure Well-Architected Framework (WAF) and the Cloud Adoption Framework for Azure (CAF) briefly covered in part 1 of this article.
The WAF learning pathways deep dives into WAF's 5 pillars:
Cost-optimization
Cost optimization does not necessarily translate into low cost but rather it involves building a team culture aware of budget, expenses, reporting and cost tracking. This cost management discipline should focus on creating a cost model that includes infrastructure cost, team expenditures and revenue, and matching it against a realistic budget that includes non-negotiable functional and non-functional requirements, personnel and training costs and processes that provide for anticipated growth. You then communicate upstream to app owners and stakeholders to insure visibility and buy-in:
develop a cost model -------→ Provide realistic budget
↑ |
refine with | more precise numbers |
| ↓
|←---------------- Communicate upstream
In order to design with a cost-efficiency mindset, you need to spend only what you need to achieve the highest ROI. Do not design beyond planned growth, prevent over-engineering and analyze the benefits of build vs buy. Fine-tune the design by providing services that can reduce overall cost and support cost guardrails (governance policies or or built-in app design patterns). For SLAs weigh the pros and cons of reserving budget for penalties versus using it for implementation.
When it comes to usage optimization, you need to maximize the use of resources and operations. Use consumption-based pricing if you don't expect to fully utilize pre-purchased compute. Prioritize deployment of active-active models or active-only over active-passive models, as part of your recovery plan, if you already paid for the resources. Decommission unused resources and clean up data.
In order to increase efficiency without redesigning, renegotiating, or sacrificing functional or nonfunctional requirements, you need to take advantage of opportunities to optimize the utility and costs of your existing resources and operations (rate optimization). Consolidate infrastructure by co-locating usage with other resources, workload and even teams. Achieve higher density but pay attention to the potential tradeoffs, especially on security boundaries. Commit and pre-purchase to take advantage of discounts offered on services for which costs and utilization are predictable. Use fixed-priced billing instead of consumption-based billing for a resource when its utilization is high and predictable.
It is not enough to go through these optimization processes only once. You need to continuously monitor and optimize over time. Right-size investments as your workloads evolve with the ecosystem. Optimize deployment environments. Production environments should be your main cost driver. Adjust architecture design decisions, resources, code, and workflows based on ROI data.
The following table contains tools that can help with cost-optimization:
Tool / Option | Description |
---|---|
Azure Cost Management + Billing | Monitor, allocate, and optimize Azure spending across subscriptions, departments, and resources. |
Azure Advisor | Provides personalized best practices, including cost-saving recommendations for underused resources. |
Azure Reservations | Pre-pay for 1 or 3 years of compute or other resources to save up to 72% vs pay-as-you-go prices. |
Azure Hybrid Benefit | Use existing Windows Server or SQL Server licenses with Software Assurance to save on VM costs. |
Azure Spot VMs | Use unused Azure capacity at deep discounts for interruptible, non-critical workloads. |
Azure Pricing Calculator | Estimate and model costs for Azure services before deploying. |
Azure TCO Calculator | Estimate total cost of ownership savings when migrating on-prem workloads to Azure. |
Budgets and Alerts | Set cost thresholds and receive alerts to proactively control spending. |
Management Groups & Tags | Organize and track spend across departments or projects using tagging and management group hierarchy. |
Cost Optimization Workbook | Dashboard template for visualizing and analyzing cost optimization opportunities. |
Operational Excellence
Central to the Operational Excellence pillar are DevOps methodologies that promote consistent workload quality by standardizing workflows and fostering strong team collaboration. This pillar establishes clear guidelines for development processes, monitoring, and release management. Its primary aim is to reduce variability, limit human error, and prevent disruptions for customers.
A healthy DevOps culture thrives on shared responsibility. Dev teams and ops teams are aligned in their goals and priorities, they keep business focus in mind. There's a clear line of ownership and accountability, and there's efficient collaboration through common systems & tools (like a shared backlog). Continuous improvements happen through knowledge sharing, blameless post-release, post-incident and retrospective reviews. Dev and ops procedures are codified to optimize efficiency and fast turnaround cycles.
Evolve apps with observability by emitting telemetry from app code, monitor data in team-specific dashboards, have a robust alter strategy (that prevents alert fatigue through actionable alerts).
Deploy with confidence is all about predictability and automation. Use Iac and treat it as you treat app code. Prefer declarative approaches (ARM, Terraform, etc) over imperative approaches. Automated deployments through pipelines, automate testing of components, classify assets and versions, follow a single deploy manifest. Deploy often, using small increments at regular cadence. Implement progressive exposure through feature flags.
The following table contains tools that can help with operational excellence:
Tool / Service | Description |
---|---|
Azure Monitor | End-to-end monitoring platform for collecting metrics, logs, and diagnostics from Azure resources. |
Azure Log Analytics | Query and analyze logs across services for troubleshooting, auditing, and operational insights. |
Azure Application Insights | Detect, diagnose, and analyze performance issues in live applications with distributed tracing. |
Azure Automation | Automate frequent, time-consuming, and error-prone IT management tasks with runbooks and PowerShell. |
Azure Automanage | Automatically configure best practices for Windows Server VMs, including backup and monitoring. |
Azure Policy | Define and enforce rules to ensure compliance and operational consistency across Azure environments. |
Azure Blueprints | Deploy and manage a repeatable set of Azure resources and policies to support governance at scale. Will be deprecated in 2026. |
Azure Update Manager | Manage and automate OS patching for Azure and on-premises VMs. |
Azure Service Health | Provides personalized alerts and guidance when Azure service issues affect your resources. |
Microsoft Defender for Cloud | Security posture management and threat protection across hybrid and multicloud environments. |
Azure Resource Health | Provides insights into the health of your individual Azure resources and helps diagnose availability issues. |
Azure DevOps | Provides CI/CD, version control, and Agile planning tools to streamline development and operations. |
VS Code + Bicep (IaC) | Use Bicep with VS Code to define and deploy Azure resources as code for repeatability and consistency. |
Performance Efficiency
Performance efficiency refers to how well your workload adapts to fluctuating demand. It’s essential that your systems can scale up to accommodate higher loads without affecting the user experience, and scale down to save resources when demand drops. Key to this is capacity—ensuring there’s enough CPU and memory available to meet these changing needs.
It's ideal to have realistic and well-defined performance targets created in collaboration with the business stakeholders. Use historical data to get visibility into usage patterns and bottlenecks. Bring insights from external factors, such as input from market analysis, experts and industry standards. Be explicit about what represents acceptable performance, based on investments and understand the business context (functional & non-functional requirements) and anticipated growth.
Design with a flow-centric focus by prioritizing critical areas that have the most effect on user and business outcomes. Break down the system into its parts and dependencies and understand each component's function and influence on performance. Establish a performance baseline.
Proactively measure performance, examine systems as a whole and avoid fine-tunning early. Design to meet capacity requirements by finding a balance between resource allocation and system requirements (right-size resources). Protect against performance degradation (achieve and sustain performance) while the system is in use and it evolves. Test for performance in development by integration automated performance tests (formal quality gates) into the build pipelines.
Optimize through observability. Monitor real transactions in production and deviations against performance targets. Use synthetic transaction testing to generate a more consistent performance baseline. Handle workload changes intelligently by addressing performance erosion as usage increases, systems evolve and data accumulates. Reset expectations and establish new targets, if fine-tuning brings only short-term benefits.
Improve efficiency through optimization by having dedicated cycles for performance optimization, enhance architecture with design patterns and components which can boost performance, use application performance monitoring tools and profilers to analyze historic trends. Stay current with tech innovations.
The following table contains tools that can help with performance efficiency:
Tool / Service | Description |
---|---|
Azure Application Insights | APM service for monitoring live applications, with telemetry, dependency tracking, and usage analytics. |
Azure Monitor | Platform-wide monitoring for collecting metrics, logs, and custom performance counters across resources. |
Azure Monitor Profiler (Snapshot Debugger) | Captures runtime performance traces of live apps without impacting users—helps diagnose CPU/memory issues. |
Visual Studio Profiler | Development-time tool for deep performance profiling, including CPU sampling, memory allocation, and .NET performance. |
Azure Load Testing | Simulates real-world traffic to evaluate and improve app scalability, responsiveness, and reliability. |
Azure Advisor (Performance) | Recommends VM resizing, SKU changes, and performance optimizations for Azure resources. |
Azure Metrics Explorer | Visualize and correlate performance metrics across services with custom dashboards and workbooks. |
Azure Service Profiler (Classic) | Legacy profiler used for tracking latency and performance hotspots in .NET web apps. |
Application Map (App Insights) | Automatically detects app components and their dependencies to visualize performance and latency bottlenecks. |
Reliability
Workload architectures should have reliability assurances in application code, infrastructure, and operation. A reliable workflow is:
- resilient: it can detect, withstand, and recover from failures within an acceptable time period.
- available: users can access the workload during the promised time period at the promised quality level. The workload should be prepared to anticipate or handle problems in production and to avoid service disruption that might lead to financial loss.
Design for:
- business requirements: Focus on the intended utility of the workflow. Clear requirements need to state clear expectations that are quantified by metrics that help you understand if the costs of implementation are within the investment limit. Set targets on individual components, system and user flows, and system as a whole. Document all dependencies and their effect on resiliency. Understand that SLAs vary by service and not all services and features are covered equally.
- resilience: Build resilience into the system so that it's fault-tolerant and can degrade gracefully. Scale-out critical path components. Identify potential points of failure, especially for critical components, and determine the effect on user and system flows. Build self-preservation mechanisms and modularize the design to isolate faults. Build redundancy in layers and resilience in application tiers. Over-provision to immediately mitigate individual failures.
- recovery: Highly resilient systems have disaster preparedness approaches in both architecture design and workload operations. Run regular recovery drills that test the process of recovering system components, data and failover and failback steps. Ensure that you can repair data by using backups. Automate self-healing capabilities to reduce risks from external factors like human intervention. Replace stateless components with immutable ephemeral units (spin up and destroy on demand - repeatability and consistency).
- operations: Test failures early and often (shift left) in the development lifecycle. Insights, diagnostics, and alerts from observable systems are fundamental to effective incident management and continuous improvement. Build observable systems that can correlate telemetry. Simulate failures and run tests in production and pre-production environments.
Keep it simple by avoiding overengineering the architecture design, application code, and operations. Keep the critical path lean. Establish standards all across the SDLC. Develop just enough code. Rely on tried and tested practices that have been used with similar workloads to minimize ops and dev burden.
The following table contains tools that can help with reliability:
Tool / Service | Description |
---|---|
Azure Availability Zones | Physically separate datacenters within a region that increase resilience and fault isolation. |
Azure Load Balancer | Distributes incoming network traffic to ensure high availability and redundancy. |
Azure Traffic Manager | DNS-based traffic routing across multiple regions for geographic failover and load distribution. |
Azure Front Door | Global entry point that provides traffic routing, failover, and health probing for high availability. |
Azure Site Recovery | Disaster recovery as a service for replicating and recovering workloads across regions or datacenters. |
Azure Backup | Simple and secure backup solution for Azure VMs, workloads, and on-prem resources. |
Azure Monitor | Comprehensive monitoring solution that helps detect failures and anomalies before they become critical. |
Azure Resource Health | Provides information on the current and historical health of Azure resources for proactive response. |
Azure Service Health | Alerts and insights into Azure outages and planned maintenance that may impact your resources. |
Azure Advisor (Reliability) | Recommends actions to improve reliability, such as enabling zone redundancy or configuring backup. |
Availability Sets | Logical grouping of VMs to prevent simultaneous downtime due to maintenance or hardware failures. |
Auto-Scale | Automatically adjusts resources based on load to ensure consistent performance and reliability. |
Security
Security is not an afterthought. You should strive to adopt and implement security practices in architectural design decisions and operations with minimal friction. Create a security readiness plan that's aligned with business priorities and which takes into account all facets of Zero Trust and the CIA Triad. Your security strategy needs to include plans for reliability, health modeling, self-preservation, defense against workload intrusion and exfiltration attacks.
Optimize security through segmentation. Establish security boundaries driven by business requirements. Have clear lines of responsibilities and roles. Isolation enables you to limit exposure of sensitive flows to only roles and assets that need access. Use industry frameworks to define a well-documented incident response plan. Define and enforce team-level security standards and strive for consistent practices.
Protect:
- confidentiality: limit access; classify data (confidentiality level, type, sensitivity, potential risk); encrypt at all stages (rest, transit, processing); block unauthorized data transfer; implement audit trails.
- integrity: defend the supply chain by threat scanning; AuthN and AuthZ access control; protect against vulnerabilities; establish trust by using cryptographic techniques; make sure back-up is encrypted and immutable; avoid/mitigate system implementations that allow your workload to operate outside its intended limits and purposes.
- availability: access to data should be done within the allowed access scope (JIT / JEA); implement security controls to protect against resource exhaustion (DDoS controls); preventative measures for attack vectors that exploit vulnerabilities; prioritize security controls on critical components and flows; apply the same security level of rigor in your recovery resources as you do in your primary environment.
Sustain and evolve your security posture. Perform threat modeling to identify and mitigate potential threats. Follow industry standard methodologies. Independently verify your controls by performing routine and integrated vulnerability scanning to detect exploits in infrastructure, dependencies and application code. Ethically hack the system. Stay current on updates, patching and security fixes. Use threat intelligence powered by security analytics for dynamic detection of threats.
The following table contains tools that can help with security:
Tool / Service | Description |
---|---|
Microsoft Defender for Cloud | Provides unified security management and threat protection across Azure, hybrid, and multicloud environments. |
Azure Key Vault | Securely stores secrets, encryption keys, and certificates with access policies and logging. |
Azure Firewall | Cloud-native network security service with threat intelligence filtering, FQDN filtering, and logging. |
Web Application Firewall (WAF) | Protects web applications from common exploits and vulnerabilities like SQL injection and XSS. |
Azure DDoS Protection | Protects your applications from volumetric distributed denial of service attacks. |
Azure Policy | Enforces organizational standards and compliance at scale across Azure resources. |
Microsoft Entra ID (formerly Azure AD) | Identity and access management service with support for SSO, MFA, conditional access, and RBAC. |
Azure Bastion | Provides secure and seamless RDP/SSH access to VMs without exposing public IP addresses. |
Azure Private Link | Enables private access to Azure services over the Microsoft backbone network, avoiding the public internet. |
Network Security Groups (NSGs) | Controls traffic flow at subnet and NIC levels using allow/deny rules for IPs, ports, and protocols. |
Microsoft Purview | Data governance, discovery, classification, and protection across your data estate. |
CAF is covered in part 4 of this article.