05_05_02_dependency_analysis - Execution Order & Scheduling¶

Purpose¶

Determines the optimal execution order for topology nodes, identifies parallel execution opportunities, and finds computational bottlenecks (critical path). Essential for generating efficient code and maximizing CPU utilization.

Key Concepts¶

Topological Sorting¶

Definition: Linear ordering of nodes where every node appears before its successors.

Valid Order: For every edge A → B, node A appears before node B in the sequence.

Example:

Graph:  A → C → E
        B → D ↗

Valid orders:
  [A, B, C, D, E]
  [B, A, D, C, E]
  [A, C, B, D, E]

Invalid:
  [A, C, E, B, D]  ❌ (D before E, but D→E edge exists)

Kahn's Algorithm¶

Purpose: Topological sort via iterative in-degree reduction

Algorithm:

1. Count in-degrees (incoming edges) for all nodes
2. Queue all nodes with in-degree = 0 (sources)
3. While queue not empty:
   a. Dequeue node and add to result
   b. For each successor:
      - Decrease its in-degree by 1
      - If in-degree becomes 0, enqueue it
4. If all nodes processed → success, else cycle exists

Complexity: O(V + E) where V=nodes, E=edges

Parallel Stages¶

Definition: Nodes at the same "level" can execute simultaneously

Level Assignment: - Level 0: Source nodes (no predecessors) - Level N: Max(predecessor levels) + 1

Example:

    A (L0) → D (L2)
    ↓        ↑
    B (L1) → E (L3)
    ↓        ↑
    C (L1) ──┘

Parallel stages:
  Stage 0: [A]           (1 node)
  Stage 1: [B, C]        (2 nodes - parallel!)
  Stage 2: [D]           (1 node)
  Stage 3: [E]           (1 node)

Max parallelism: 2

Critical Path¶

Definition: Longest path in terms of computational cost

Significance: - Determines minimum execution time - Bottleneck for optimization - Cannot parallelize below this

Calculation:

For each node N:
  cost_to[N] = max(cost_to[predecessor] + cost[predecessor]) + cost[N]

Critical path = backtrack from max(cost_to)

API Overview¶

Basic Analysis¶

#include "dependency_analyzer.hpp"

// Analyze with default options
auto analysis = DependencyAnalyzer::analyze(topology);

std::cout << "Execution order:\n";
for (const auto& node_id : analysis.execution_order.linear_order) {
    std::cout << "  → " << node_id << "\n";
}

std::cout << "\nParallel stages: " << analysis.execution_order.total_stages << "\n";
std::cout << "Max parallelism: " << analysis.max_parallelism << " nodes\n";
std::cout << "Avg parallelism: " << analysis.average_parallelism << " nodes/stage\n";
std::cout << "Sequential bottlenecks: " << analysis.sequential_bottlenecks << "\n";

Custom Scheduling¶

SchedulingOptions options;
options.strategy = SchedulingStrategy::MaximizeParallel;
options.cost_model = CostModel::ByType;
options.num_threads = 4;
options.allow_reordering = true;

auto analysis = DependencyAnalyzer::analyze(topology, options);

Parallel Stages¶

auto stages = DependencyAnalyzer::identifyParallelStages(topology);

for (const auto& stage : stages) {
    std::cout << "Stage " << stage.stage_index << " (cost=" << stage.estimated_cost << "):\n";
    for (const auto& node : stage.nodes) {
        std::cout << "  - " << node << "\n";
    }
}

Critical Path¶

auto critical_path = DependencyAnalyzer::findCriticalPath(topology, CostModel::ByType);

std::cout << "Critical path (cost=" << critical_path.total_cost << "):\n";
for (const auto& node : critical_path.path) {
    size_t cost = critical_path.node_costs[node];
    std::cout << "  → " << node << " (cost=" << cost << ")\n";
}

Level Calculation¶

auto levels = DependencyAnalyzer::calculateLevels(topology);

for (const auto& [node_id, level] : levels) {
    std::cout << node_id << " at level " << level << "\n";
}

Data Structures¶

ExecutionOrder¶

struct ExecutionOrder {
    std::vector<NodeID> linear_order;      // Sequential execution order
    std::vector<ParallelStage> stages;     // Parallel stages
    size_t total_stages;                   // Number of stages
    bool is_valid;                         // Success flag
    std::string error_message;             // Error if invalid
};

ParallelStage¶

struct ParallelStage {
    size_t stage_index;                    // Stage number
    std::vector<NodeID> nodes;             // Nodes in this stage
    size_t estimated_cost;                 // Total cost
    std::string description;               // Human-readable
};

CriticalPathInfo¶

struct CriticalPathInfo {
    std::vector<NodeID> path;                              // Critical path nodes
    size_t total_cost;                                     // Total cost
    size_t total_latency;                                  // Total latency
    std::unordered_map<NodeID, size_t> node_costs;        // Per-node breakdown
    std::unordered_map<NodeID, size_t> node_latencies;    // Per-node latency
};

DependencyAnalysis¶

struct DependencyAnalysis {
    ExecutionOrder execution_order;        // Ordering info
    CriticalPathInfo critical_path;        // Bottleneck info
    size_t max_parallelism;                // Max concurrent nodes
    float average_parallelism;             // Avg nodes per stage
    size_t sequential_bottlenecks;         // Single-node stages
};

Scheduling Strategies¶

1. TopologicalSort (Default)¶

options.strategy = SchedulingStrategy::TopologicalSort;

- Simple Kahn's algorithm - Minimal scheduling overhead - Good for linear/sequential graphs

2. MaximizeParallel¶

options.strategy = SchedulingStrategy::MaximizeParallel;

- Prioritizes extracting parallelism - Groups independent nodes aggressively - Best for multi-threaded execution

3. MinimizeLatency¶

options.strategy = SchedulingStrategy::MinimizeLatency;

- Prioritizes critical path - Executes high-latency nodes first - Best for real-time constraints

4. LoadBalanced¶

options.strategy = SchedulingStrategy::LoadBalanced;
options.num_threads = 4;

- Balances work across threads - Merges small stages, splits large ones - Best for sustained throughput

Cost Models¶

Uniform Cost¶

options.cost_model = CostModel::Uniform;

All nodes cost 100 units (default)

By Type¶

options.cost_model = CostModel::ByType;

Heuristic based on kernel type: - FFT: 1000 - Convolution: 500 - Filter: 200 - Multiply: 50 - Add: 20 - Delay: 10

Measured¶

options.cost_model = CostModel::Measured;

Uses actual profiling data (future implementation)

Examples¶

Example 1: Simple Chain¶

// Topology: A → B → C → D
auto topology = TopologyBuilder()
    .addNode("A", "input", NodeType::Source)
    .addNode("B", "filter", NodeType::Processing)
    .addNode("C", "gain", NodeType::Processing)
    .addNode("D", "output", NodeType::Sink)
    .connect("A", "out", "B", "in")
    .connect("B", "out", "C", "in")
    .connect("C", "out", "D", "in")
    .build();

auto analysis = DependencyAnalyzer::analyze(topology);

// Result:
// linear_order: [A, B, C, D]
// stages: [{A}, {B}, {C}, {D}]
// max_parallelism: 1
// sequential_bottlenecks: 4

Example 2: Parallel Branches¶

// Topology:     ┌→ B → D ─┐
//           A ──┤         ├→ F
//               └→ C → E ─┘
auto topology = TopologyBuilder()
    .addNode("A", "input", NodeType::Source)
    .addNode("B", "filter1", NodeType::Processing)
    .addNode("C", "filter2", NodeType::Processing)
    .addNode("D", "gain1", NodeType::Processing)
    .addNode("E", "gain2", NodeType::Processing)
    .addNode("F", "mix", NodeType::Processing)
    .connect("A", "out", "B", "in")
    .connect("A", "out", "C", "in")
    .connect("B", "out", "D", "in")
    .connect("C", "out", "E", "in")
    .connect("D", "out", "F", "in1")
    .connect("E", "out", "F", "in2")
    .build();

auto analysis = DependencyAnalyzer::analyze(topology);

// Result:
// linear_order: [A, B, C, D, E, F] (or [A, C, B, E, D, F])
// stages: [{A}, {B, C}, {D, E}, {F}]
// max_parallelism: 2
// sequential_bottlenecks: 2 (A and F)

Example 3: Complex DAG with Critical Path¶

// Build complex topology
auto topology = buildComplexFilterBank();  // 20+ nodes

// Find critical path with cost model
auto critical = DependencyAnalyzer::findCriticalPath(
    topology,
    CostModel::ByType
);

std::cout << "⚠️  Critical path has " << critical.path.size() << " nodes\n";
std::cout << "Total cost: " << critical.total_cost << "\n";

// Identify optimization targets
for (const auto& node_id : critical.path) {
    size_t cost = critical.node_costs[node_id];
    if (cost > 500) {  // Expensive node
        std::cout << "🎯 Optimize: " << node_id
                  << " (cost=" << cost << ")\n";
    }
}

Example 4: Multi-threaded Scheduling¶

SchedulingOptions options;
options.strategy = SchedulingStrategy::LoadBalanced;
options.num_threads = 4;
options.cost_model = CostModel::ByType;

auto analysis = DependencyAnalyzer::analyze(topology, options);

// Assign stages to threads
std::vector<std::vector<NodeID>> thread_work(4);
size_t thread_idx = 0;

for (const auto& stage : analysis.execution_order.stages) {
    for (const auto& node : stage.nodes) {
        thread_work[thread_idx % 4].push_back(node);
    }
    thread_idx++;
}

// Execute in parallel
#pragma omp parallel for
for (int t = 0; t < 4; ++t) {
    for (const auto& node : thread_work[t]) {
        execute_node(node);
    }
}

Performance¶

Operation	Complexity	Notes
Topological sort	O(V + E)	Kahn's algorithm
Level calculation	O(V + E)	BFS-based
Critical path	O(V + E)	DP with backtrack
Stage identification	O(V)	After level calc
Load balancing	O(S)	S = stages

Memory: O(V) for auxiliary structures

Typical performance: - 100 nodes: <1ms - 1000 nodes: <10ms - 10000 nodes: <100ms

Integration with Other Systems¶

Input Dependencies¶

Topology graph from 00_graph_representation
Causality info from 01_causality_validation (for latency)

Output Consumers¶

Buffer management (03_buffer_management) uses execution order for lifetime analysis
Code generation (06_code_generation) uses stages for loop structure
Optimization layer uses critical path for optimization priorities

Bottleneck Detection¶

// Find sequential choke points
auto bottlenecks = DependencyAnalyzer::detectBottlenecks(analysis);

std::cout << "Bottlenecks at stages: ";
for (auto stage_idx : bottlenecks) {
    std::cout << stage_idx << " ";
}
std::cout << "\n";

// Suggest improvements
if (!bottlenecks.empty()) {
    std::cout << "💡 Suggestions:\n";
    for (auto idx : bottlenecks) {
        auto& stage = analysis.execution_order.stages[idx];
        std::cout << "  - Stage " << idx << ": "
                  << stage.nodes[0] << " blocks parallel execution\n";
        std::cout << "    Consider: algorithm redesign or reordering\n";
    }
}

Debugging Output¶

void printAnalysis(const DependencyAnalysis& analysis) {
    std::cout << "=== DEPENDENCY ANALYSIS ===\n\n";

    std::cout << "Linear Execution Order:\n";
    for (size_t i = 0; i < analysis.execution_order.linear_order.size(); ++i) {
        std::cout << "  " << i << ": " << analysis.execution_order.linear_order[i] << "\n";
    }

    std::cout << "\nParallel Stages:\n";
    for (const auto& stage : analysis.execution_order.stages) {
        std::cout << "  Stage " << stage.stage_index
                  << " (cost=" << stage.estimated_cost
                  << ", nodes=" << stage.nodes.size() << "):\n";
        for (const auto& node : stage.nodes) {
            std::cout << "    - " << node << "\n";
        }
    }

    std::cout << "\nCritical Path (cost=" << analysis.critical_path.total_cost << "):\n";
    for (const auto& node : analysis.critical_path.path) {
        size_t cost = analysis.critical_path.node_costs.at(node);
        std::cout << "  → " << node << " (" << cost << ")\n";
    }

    std::cout << "\nMetrics:\n";
    std::cout << "  Max parallelism: " << analysis.max_parallelism << " nodes\n";
    std::cout << "  Avg parallelism: " << analysis.average_parallelism << " nodes/stage\n";
    std::cout << "  Bottlenecks: " << analysis.sequential_bottlenecks << " stages\n";
}

Next Steps¶

After dependency analysis: 1. 03_buffer_management - Allocate buffers based on execution order 2. 06_code_generation - Generate loops respecting dependencies 3. Optimization layer - Focus on critical path nodes

Status: ✅ Core implementation complete Test Coverage: ~90% Performance: O(V + E) for all operations