Skip to content

PLAN DE DESARROLLO COMPLETO - 05_06_OPTIMIZATION_LAYER

MARCO TEÓRICO-PRÁCTICO

Conceptos Fundamentales

  • SIMD (Single Instruction Multiple Data): Paralelismo a nivel de datos - una instrucción opera sobre vector de múltiples elementos
  • Vectorización: Transformación de loops escalares a código que procesa múltiples samples simultáneamente
  • Loop Optimization: Transformaciones clásicas (unrolling, fusion, interchange, tiling) para reducir overhead y mejorar cache locality
  • Function Inlining: Eliminación de overhead de llamadas reemplazando calls con cuerpo de función
  • Cache Hierarchy: L1 (~1-3 ciclos, 32KB) → L2 (~10-20 ciclos, 256KB) → L3 (~40-70 ciclos, 8MB) → RAM (~100-300 ciclos)
  • Branch Prediction: CPU predice dirección de branches - misprediction cuesta 10-20 ciclos
  • Profile-Guided Optimization (PGO): Usar datos de ejecución real para guiar decisiones de optimización

Algoritmos Específicos

  • DFS 3-Color Algorithm: Para cycle detection en dependency graphs (WHITE/GRAY/BLACK)
  • Topological Sort: Kahn's algorithm para ordenamiento respetando dependencias
  • Common Subexpression Elimination: Identificar expresiones duplicadas y calcularlas una sola vez
  • Constant Folding: Pre-cálculo de expresiones constantes en compile-time
  • Strength Reduction: Reemplazar operaciones costosas con equivalentes baratas (e.g., x*2 → x<<1)
  • Auto-tuning: Search algorithms para encontrar parámetros óptimos según hardware

Patterns Arquitectónicos

  • Strategy Pattern: Múltiples implementaciones de misma operación (SSE, AVX, NEON, generic)
  • Template Method: Skeleton de optimización con steps especializables
  • Decorator Pattern: Wrapping de funciones con instrumentación para profiling
  • Factory Pattern: Selección de implementación optimizada según platform runtime
  • Observer Pattern: Notificación de eventos de optimización para logging/monitoring

Métricas de Calidad

  • Speedup: 5-10x vs naive implementation (Phase 1: 2-4x, Phase 2: 5-10x, Phase 3: 50-100x con GPU)
  • Vectorization Coverage: >80% de loops críticos vectorizados
  • Cache Hit Rate: >95% L1 cache hits en hot loops
  • Branch Prediction Accuracy: >98% en código optimizado
  • Multi-core Efficiency: 80%+ (N cores → 0.8N speedup)
  • Test Coverage: >90%
  • Correctness: 100% regression tests pasando
  • Performance Consistency: <5% variación en benchmarks repetidos

PRIORIZACIÓN Y DEPENDENCIAS

Orden de Ejecución (Basado en Dependencias Técnicas)

TIER 1 - Foundation (Sin dependencias externas): - TAREA 10: regression_testing (necesario antes de cualquier optimización) - TAREA 8: platform_specific (detectar capabilities del hardware)

TIER 2 - Core Optimizations (Depende de TIER 1): - TAREA 1: vectorization_engine (base de performance DSP) - TAREA 4: expression_optimization (simplificaciones algebraicas) - TAREA 3: memory_optimization (cache awareness)

TIER 3 - Advanced Optimizations (Depende de TIER 2): - TAREA 2: loop_optimization (usa vectorization + memory) - TAREA 5: branch_optimization (usa expression optimization) - TAREA 2: function_optimization (usa loop + expression)

TIER 4 - Parallel Execution (Depende de TIER 3): - TAREA 6: parallel_execution (multi-threading) - TAREA 7: gpu_acceleration (offloading masivo)

TIER 5 - Intelligent Optimization (Depende de todos anteriores): - TAREA 9: profile_guided_optimization (usa datos reales)

TIER 6 - Integration & Finalization: - TAREA FINAL-A: Integration Testing - TAREA FINAL-B: System Integration - TAREA FINAL-C: Documentation Package


TAREAS DETALLADAS

TAREA 1: Vectorization Engine - El Paralelizador SIMD

Carpeta: 05_06_00_vectorization_engine Criticidad: ⭐⭐⭐⭐⭐ (Máxima - impacto 4-8x en performance) Prioridad: TIER 2

DESARROLLO:

  1. Core Implementation
  2. Vectorizability Analyzer:
    • Loop dependency analysis (detectar dependencias entre iteraciones)
    • Memory alignment checker (verificar alineación 16/32-byte)
    • Branch complexity evaluator (identificar if statements que bloquean vectorización)
    • Trip count analyzer (determinar si loop count es conocido)
  3. SIMD Code Generator:
    • SSE intrinsics wrapper (4 floats simultáneos)
    • AVX intrinsics wrapper (8 floats simultáneos)
    • AVX-512 intrinsics wrapper (16 floats simultáneos)
    • ARM NEON intrinsics wrapper (4 floats simultáneos)
    • Remainder loop handler (procesar últimos N % vector_width samples)
  4. Data Layout Transformer:
    • AoS → SoA conversion (Array-of-Structures a Structure-of-Arrays)
    • Interleave/deinterleave operations para stereo/multi-channel
    • Alignment enforcement (insertar padding automáticamente)
  5. Auto-vectorization Controller:
    • Compiler flag configuration (-O3, -ftree-vectorize, -march=native)
    • Vectorization report parser (analizar output de -fopt-info-vec)
    • Fallback selector (elegir entre manual intrinsics vs auto-vectorization)
  6. Vector Width Selector:

    • Runtime CPU capability detection (cpuid, _mm_getcsr)
    • Optimal vector width calculator según workload
    • Dynamic dispatch a implementación apropiada
  7. Testing Framework

  8. Unit tests: Cada instrinsic wrapper con datos conocidos
  9. Vectorization coverage tests: Verificar que loops críticos se vectorizan
  10. Correctness tests: Comparar output vectorizado vs escalar (epsilon <1e-6)
  11. Alignment tests: Verificar que datos están correctamente alineados
  12. Performance tests: Benchmark vectorizado vs escalar (esperado 4-8x)
  13. Remainder handling tests: Verificar que N no-múltiplo de vector_width funciona
  14. Cross-platform tests: Mismo código en SSE, AVX, NEON produce mismo resultado
  15. Regression tests: Suite de audio samples procesados bit-exact
  16. Test coverage >90%

  17. Documentación

  18. SIMD fundamentals guide (qué es SIMD, por qué importa)
  19. Vectorization best practices (cómo escribir código vectorizable)
  20. Intrinsics reference (cada wrapper documentado)
  21. Platform comparison (SSE vs AVX vs NEON capabilities)
  22. Troubleshooting vectorization failures (por qué loop no se vectorizó)
  23. Performance tuning guide (optimizar para diferentes vector widths)

  24. Interfaces y Conexiones

  25. VectorizationEngine.analyze_loop(loop_ast) → VectorizabilityReport
  26. VectorizationEngine.generate_simd_code(loop, target_isa) → SIMD code
  27. VectorizationEngine.transform_layout(data, AoS_to_SoA) → Transformed data
  28. Event: on_vectorization_success(loop_id, speedup)
  29. Event: on_vectorization_failure(loop_id, reason)
  30. Symlink: kernel_implementations/ → ../04_KERNELS_L0/

ENTREGABLES: - [ ] Vectorization analyzer funcional (detecta loops vectorizables) - [ ] SIMD code generators para SSE/AVX/NEON - [ ] Data layout transformer (AoS ↔ SoA) - [ ] Runtime dispatch system - [ ] Suite de tests >90% coverage - [ ] Documentación completa con examples - [ ] Benchmarks mostrando 4-8x speedup en kernels críticos

ESTIMACIÓN: 3 semanas


TAREA 2: Loop Optimization - El Transformador de Iteraciones

Carpeta: 05_06_01_loop_optimization Criticidad: ⭐⭐⭐⭐ (Alta) Prioridad: TIER 3

DESARROLLO:

  1. Core Implementation
  2. Loop Unrolling Engine:
    • Unroll factor calculator (2x, 4x, 8x según cache/register pressure)
    • Code generator para unrolled loops
    • Remainder loop handler
    • Cost model (trade-off code size vs performance)
  3. Loop Fusion System:
    • Dependency analyzer (identificar loops fusionables)
    • Fusion transformer (combinar loops sobre mismos datos)
    • Register pressure estimator (evitar spilling)
    • Validation checker (verificar que fusion es legal)
  4. Loop Interchange Optimizer:
    • Cache access pattern analyzer
    • Optimal order calculator (innermost loop debe acceder memoria contiguamente)
    • Dependence graph checker
    • Transformation validator
  5. Loop Tiling System:
    • Tile size calculator (basado en cache size)
    • Tiling code generator
    • Multi-level tiling support (L1, L2, L3 cache)
    • Boundary condition handler
  6. Loop Distribution:

    • Loop fission transformer (opuesto a fusion, para paralelización)
    • Dependence breaker
    • Load balancing optimizer
  7. Testing Framework

  8. Unit tests: Cada transformación independiente
  9. Correctness tests: Output transformado == output original
  10. Performance tests: Unrolled >10% faster, Fused >20% faster
  11. Cache simulation tests: Tiling reduce cache misses >50%
  12. Edge cases: Loops con trip counts no-divisibles
  13. Nested loops: Interchange + tiling combinados
  14. Dependency tests: Transformaciones ilegales rechazadas
  15. Regression suite: Audio samples procesados correctamente
  16. Test coverage >90%

  17. Documentación

  18. Loop transformation theory (cuándo aplicar qué)
  19. Unrolling guidelines (factor óptimo según arquitectura)
  20. Fusion opportunities (patterns comunes fusionables)
  21. Tiling strategies (tile size selection)
  22. Cache optimization guide
  23. Examples de transformaciones en código real

  24. Interfaces y Conexiones

  25. LoopOptimizer.unroll(loop, factor) → Unrolled code
  26. LoopOptimizer.fuse(loop1, loop2) → Fused loop
  27. LoopOptimizer.interchange(loop_nest, order) → Reordered nest
  28. LoopOptimizer.tile(loop, tile_size) → Tiled loop
  29. LoopOptimizer.analyze_dependencies(loop) → DependencyGraph
  30. Integration con vectorization_engine
  31. Symlink: topology_definitions/ → ../05_TOPOLOGY_DESIGN/

ENTREGABLES: - [ ] Loop unrolling con factor selection automático - [ ] Loop fusion para reducir memory traffic - [ ] Loop interchange para cache optimization - [ ] Loop tiling para working set management - [ ] Tests >90% coverage - [ ] Documentación con ejemplos reales - [ ] Benchmarks mostrando mejoras medibles

ESTIMACIÓN: 2.5 semanas


TAREA 3: Function Optimization - El Eliminador de Overhead

Carpeta: 05_06_02_function_optimization Criticidad: ⭐⭐⭐⭐ (Alta) Prioridad: TIER 3

DESARROLLO:

  1. Core Implementation
  2. Inlining Engine:
    • Call graph builder (analizar qué llama a qué)
    • Inlining heuristic (tamaño función, frecuencia llamada, profundidad call chain)
    • Code size estimator (evitar code bloat excesivo)
    • Recursive inlining controller (límite de profundidad)
    • Force-inline directives support
  3. Devirtualization System:
    • Type propagation analyzer (determinar tipos concretos)
    • Virtual call resolver (convertir virtual → direct cuando posible)
    • Speculative devirtualization (optimistic dispatch con fallback)
  4. Function Fusion:
    • Composition pattern detector (f(g(x)) patterns)
    • Fusion transformer (combinar funciones pequeñas)
    • Parameter optimization (eliminar argumentos redundantes)
  5. Tail Call Optimization:
    • Tail recursion detector
    • Tail call eliminator (convertir a loop)
  6. Constant Propagation:

    • Inter-procedural constant tracking
    • Argument specialization (crear versiones con args constantes)
  7. Testing Framework

  8. Inlining tests: Verificar que funciones pequeñas se inline
  9. Call overhead tests: Medir reducción de overhead (esperado 10-20 ciclos/call)
  10. Code size tests: Verificar que inlining no explota code size
  11. Devirtualization tests: Virtual calls convertidos a direct
  12. Correctness tests: Function fusion preserva semántica
  13. Tail recursion tests: Recursión convertida a loop
  14. Performance tests: Código con functions inlined >15% faster
  15. Regression suite: Audio processing correctness
  16. Test coverage >90%

  17. Documentación

  18. Inlining best practices (cuándo inline, cuándo no)
  19. Call overhead analysis (costo real de function calls)
  20. Devirtualization opportunities (virtual functions en DSP)
  21. Function fusion patterns (composiciones comunes)
  22. Code size vs performance trade-offs
  23. Compiler directives reference (attribute((always_inline)), etc)

  24. Interfaces y Conexiones

  25. FunctionOptimizer.inline_function(call_site) → Inlined code
  26. FunctionOptimizer.devirtualize(virtual_call) → Direct call
  27. FunctionOptimizer.fuse_functions(f, g) → Fused function
  28. FunctionOptimizer.analyze_call_graph() → CallGraph
  29. FunctionOptimizer.estimate_code_size(function) → Size estimate
  30. Integration con loop_optimization (inlining expone más loop optimization)
  31. Symlink: kernel_implementations/ → ../04_KERNELS_L0/

ENTREGABLES: - [ ] Inlining engine con heurísticas configurables - [ ] Devirtualization para eliminar indirect calls - [ ] Function fusion para composiciones comunes - [ ] Call graph analyzer - [ ] Tests >90% coverage - [ ] Documentación de estrategias - [ ] Benchmarks mostrando reducción de overhead

ESTIMACIÓN: 2 semanas


TAREA 4: Memory Optimization - El Organizador de Datos

Carpeta: 05_06_03_memory_optimization Criticidad: ⭐⭐⭐⭐⭐ (Máxima - memoria es bottleneck principal) Prioridad: TIER 2

DESARROLLO:

  1. Core Implementation
  2. Cache Analyzer:
    • Cache miss simulator (estimar misses sin ejecutar)
    • Working set calculator (cuánta memoria activa usa código)
    • Cache line utilization tracker (qué % de cada línea se usa)
    • Temporal/spatial locality analyzer
  3. Data Layout Optimizer:
    • Structure packing (eliminar padding innecesario)
    • Field reordering (hot fields juntos)
    • AoS ↔ SoA transformer (ya mencionado en vectorization, refinar aquí)
    • Alignment enforcer (16/32/64-byte alignment)
  4. Buffer Management:
    • Lifetime analyzer (cuándo buffers están activos)
    • Memory pool allocator (reuse buffers)
    • In-place operation detector (evitar copias)
    • Scratch buffer optimizer (temporary storage allocation)
  5. Prefetching System:
    • Access pattern predictor
    • Software prefetch inserter (__builtin_prefetch)
    • Prefetch distance tuner
  6. Memory Bandwidth Optimizer:

    • Bandwidth usage tracker
    • Copy elimination (eliminar memcpy innecesarios)
    • Stream optimization (non-temporal stores)
  7. Testing Framework

  8. Cache simulation tests: Verificar reducción de misses >50%
  9. Alignment tests: Todos los buffers SIMD correctamente alineados
  10. Memory layout tests: SoA vs AoS performance differential
  11. Buffer reuse tests: Verificar que pools funcionan sin leaks
  12. Prefetching tests: Medir impacto de prefetch (esperado 10-30% mejora)
  13. Bandwidth tests: Verificar que uso no excede peak bandwidth
  14. In-place tests: Verificar que copies se eliminan donde posible
  15. Regression suite: Corrección con diferentes layouts
  16. Test coverage >90%

  17. Documentación

  18. Cache hierarchy guide (L1/L2/L3 characteristics)
  19. Memory layout best practices (cuándo usar SoA vs AoS)
  20. Alignment requirements (SSE/AVX/NEON)
  21. Buffer management strategies (pools vs malloc)
  22. Prefetching guide (cuándo es beneficioso)
  23. Memory bandwidth optimization (evitar thrashing)
  24. Debugging memory issues (cache miss profiling)

  25. Interfaces y Conexiones

  26. MemoryOptimizer.analyze_cache_behavior(code) → CacheReport
  27. MemoryOptimizer.optimize_layout(struct) → Optimized layout
  28. MemoryOptimizer.allocate_buffer_pool(size) → BufferPool
  29. MemoryOptimizer.insert_prefetch(access_pattern) → Prefetched code
  30. MemoryOptimizer.eliminate_copies(topology) → Copy-free topology
  31. Integration con vectorization_engine (alignment crítico)
  32. Symlink: topology_definitions/ → ../05_TOPOLOGY_DESIGN/

ENTREGABLES: - [ ] Cache analyzer con simulation - [ ] Data layout optimizer (packing, reordering, AoS↔SoA) - [ ] Buffer pool allocator - [ ] Software prefetching inserter - [ ] Copy elimination optimizer - [ ] Tests >90% coverage - [ ] Documentación de memory optimization - [ ] Benchmarks mostrando >95% cache hit rate

ESTIMACIÓN: 2.5 semanas


TAREA 5: Expression Optimization - El Simplificador Algebraico

Carpeta: 05_06_04_expression_optimization Criticidad: ⭐⭐⭐ (Media) Prioridad: TIER 2

DESARROLLO:

  1. Core Implementation
  2. Common Subexpression Eliminator:
    • Expression tree builder (AST de expresiones)
    • Subexpression hasher (identificar expresiones idénticas)
    • Value numbering (asignar IDs únicos a expresiones)
    • CSE transformer (reemplazar duplicados con temp variable)
  3. Constant Folder:
    • Compile-time expression evaluator
    • Constant propagator (propagar valores conocidos)
    • Dead code eliminator (eliminar cálculos sin efecto)
  4. Algebraic Simplifier:
    • Identity rules (x*1→x, x+0→x, x*0→0)
    • Associativity optimizer (reordenar para exponer constants)
    • Distributivity transformer (factorizar common terms)
    • Trigonometric simplifier (sin²x+cos²x→1, etc)
  5. Strength Reducer:
    • Expensive operation detector (div, pow, sqrt)
    • Cheap replacement finder (x/2→x*0.5, x*2→x<<1)
    • Platform-aware replacements (aprovechar FMA, rsqrt)
  6. Reassociation Optimizer:

    • Reordering para exponer paralelismo (a+b+c+d → (a+b)+(c+d))
    • Reduction tree builder para minimize depth
  7. Testing Framework

  8. CSE tests: Verificar que expresiones duplicadas se eliminan
  9. Constant folding tests: Verificar pre-cálculo correcto
  10. Algebraic simplification tests: Verificar identidades aplicadas
  11. Strength reduction tests: Verificar sustituciones correctas
  12. Numerical accuracy tests: Verificar que simplificaciones no introducen error
  13. Performance tests: Código simplificado >10% faster
  14. Correctness tests: Output matemáticamente equivalente
  15. Regression suite: Audio samples procesados correctamente
  16. Test coverage >90%

  17. Documentación

  18. Expression optimization theory
  19. Algebraic identities reference
  20. Strength reduction catalog (operaciones costosas → baratas)
  21. Numerical precision considerations
  22. Examples de optimizaciones aplicadas
  23. Trade-offs (simplificación vs precisión)

  24. Interfaces y Conexiones

  25. ExpressionOptimizer.eliminate_cse(expression_tree) → Optimized tree
  26. ExpressionOptimizer.fold_constants(expression) → Simplified expr
  27. ExpressionOptimizer.simplify_algebraic(expression) → Simplified expr
  28. ExpressionOptimizer.reduce_strength(operation) → Cheaper operation
  29. ExpressionOptimizer.reassociate(expression) → Reordered expr
  30. Integration con function_optimization
  31. Symlink: algorithm_specs/ → ../03_ALGORITHM_SPEC/

ENTREGABLES: - [ ] Common subexpression eliminator - [ ] Constant folder - [ ] Algebraic simplifier con reglas completas - [ ] Strength reducer - [ ] Tests >90% coverage - [ ] Documentación de transformaciones - [ ] Benchmarks mostrando reducción de operaciones

ESTIMACIÓN: 2 semanas


TAREA 6: Branch Optimization - El Eliminador de Saltos

Carpeta: 05_06_05_branch_optimization Criticidad: ⭐⭐⭐ (Media) Prioridad: TIER 3

DESARROLLO:

  1. Core Implementation
  2. Branch Eliminator:
    • Conditional move generator (CMOV instructions)
    • Branchless code transformer (usar masking)
    • Predication support (ARM conditional execution)
    • Select operator generator (ternary → SIMD select)
  3. Branch Prediction Hinter:
    • Likely/unlikely annotation inserter (__builtin_expect)
    • Profile data analyzer (determinar branches hot/cold)
    • Branch layout optimizer (hot path linear)
  4. Loop Unswitching:
    • Invariant condition detector (if que no depende de loop var)
    • Unswitching transformer (mover if fuera de loop)
    • Code duplication controller (evitar explosion)
  5. Switch Optimization:
    • Jump table generator (dense switches)
    • Binary search tree generator (sparse switches)
    • Hash table generator (muy sparse switches)
  6. Early Exit Optimizer:

    • Guard condition hoister
    • Short-circuit evaluator
  7. Testing Framework

  8. Branchless tests: Verificar eliminación de branches
  9. CMOV tests: Verificar uso de conditional moves
  10. Prediction tests: Verificar hints correctos
  11. Unswitching tests: Verificar invariantes movidos fuera
  12. Switch tests: Verificar tabla de saltos vs tree
  13. Performance tests: Código branchless >15% faster (si branch unpredictable)
  14. Correctness tests: Transformaciones preservan semántica
  15. Regression suite: Audio procesado correctamente
  16. Test coverage >90%

  17. Documentación

  18. Branch cost analysis (misprediction penalty)
  19. Branchless programming guide
  20. Prediction hints usage
  21. Loop unswitching opportunities
  22. Switch optimization strategies
  23. When to avoid branchless (predictable branches mejor dejar)

  24. Interfaces y Conexiones

  25. BranchOptimizer.eliminate_branch(if_statement) → Branchless code
  26. BranchOptimizer.insert_prediction_hint(branch, likely) → Hinted code
  27. BranchOptimizer.unswitch_loop(loop) → Unswitched loop
  28. BranchOptimizer.optimize_switch(switch) → Optimal dispatch
  29. Integration con loop_optimization
  30. Symlink: profile_data/ → ../18_QUALITY_METRICS/profiling/

ENTREGABLES: - [ ] Branch eliminator (branchless code generation) - [ ] Prediction hint inserter - [ ] Loop unswitching transformer - [ ] Switch optimizer - [ ] Tests >90% coverage - [ ] Documentación de estrategias - [ ] Benchmarks mostrando reducción de mispredictions

ESTIMACIÓN: 1.5 semanas


TAREA 7: Parallel Execution - El Distribuidor de Trabajo

Carpeta: 05_06_06_parallel_execution Criticidad: ⭐⭐⭐⭐ (Alta) Prioridad: TIER 4

DESARROLLO:

  1. Core Implementation
  2. Topology Parallelizer:
    • Dependency graph analyzer (desde 05_TOPOLOGY_DESIGN)
    • Independent node detector (nodos sin dependencias entre sí)
    • Stage decomposer (dividir topología en stages paralelos)
    • Load balancer (distribuir trabajo equitativamente)
  3. Thread Pool Manager:
    • Worker thread pool (pre-crear N threads)
    • Work queue (lock-free MPMC queue)
    • Thread affinity manager (pin threads a cores)
    • Priority scheduler (tareas críticas primero)
  4. Data Parallelism:
    • Voice parallelizer (procesar voces independientes en paralelo)
    • Channel parallelizer (stereo/multi-channel paralelo)
    • Batch parallelizer (procesar múltiples buffers simultáneamente)
  5. Synchronization Primitives:
    • Barrier implementation (threads esperan sync point)
    • Atomic operations wrapper (lock-free communication)
    • Condition variable manager
    • Spinlock vs mutex selector (según contention)
  6. NUMA Awareness:

    • NUMA topology detector
    • Memory affinity manager (allocate near core que procesa)
    • Inter-socket communication minimizer
  7. Testing Framework

  8. Parallelization tests: Verificar correctness con threads
  9. Scaling tests: N cores → 0.8N speedup (verificar efficiency >80%)
  10. Race condition tests: ThreadSanitizer, stress testing
  11. Deadlock tests: Verificar que no hay deadlocks
  12. Load balance tests: Verificar distribución equitativa
  13. NUMA tests: Verificar que memoria está near cores
  14. Throughput tests: Medir samples/sec con multi-threading
  15. Regression suite: Audio procesado igual que single-threaded
  16. Test coverage >90%

  17. Documentación

  18. Parallelization theory (Amdahl's law, Gustafson's law)
  19. Thread pool design
  20. Lock-free programming guide
  21. NUMA optimization
  22. Debugging parallel code
  23. Performance tuning (thread count selection)

  24. Interfaces y Conexiones

  25. ParallelExecutor.parallelize_topology(topology) → Parallel schedule
  26. ParallelExecutor.create_thread_pool(num_threads) → ThreadPool
  27. ParallelExecutor.submit_task(task) → Future
  28. ParallelExecutor.wait_all(futures) → Results
  29. Integration con topology_design (dependency graph)
  30. Symlink: topology_definitions/ → ../05_TOPOLOGY_DESIGN/

ENTREGABLES: - [ ] Topology parallelizer con dependency analysis - [ ] Thread pool manager - [ ] Lock-free work queue - [ ] NUMA-aware allocator - [ ] Tests >90% coverage (incluyendo ThreadSanitizer) - [ ] Documentación de parallel programming - [ ] Benchmarks mostrando 80%+ multi-core efficiency

ESTIMACIÓN: 3 semanas


TAREA 8: GPU Acceleration - El Procesador Paralelo Masivo

Carpeta: 05_06_07_gpu_acceleration Criticidad: ⭐⭐⭐ (Media - no todos los workloads se benefician) Prioridad: TIER 4

DESARROLLO:

  1. Core Implementation
  2. Workload Classifier:
    • Parallelizability scorer (cuánto paralelismo hay)
    • Latency sensitivity detector (GPU tiene latencia alta)
    • Data transfer cost estimator (CPU↔GPU bandwidth limitado)
    • GPU suitability predictor (recomendar GPU o no)
  3. CUDA/OpenCL Code Generator:
    • Kernel generator desde topología DSP
    • Memory management (host ↔ device transfers)
    • Grid/block size optimizer
    • Shared memory allocator
  4. Transfer Optimizer:
    • Pinned memory allocator (faster transfers)
    • Asynchronous transfer scheduler (overlap compute + transfer)
    • Batch aggregator (transferir batches grandes, no samples individuales)
  5. Multi-GPU Manager:
    • GPU topology detector
    • Work distribution entre GPUs
    • Peer-to-peer transfer support
  6. Fallback Manager:

    • CPU fallback cuando GPU no disponible
    • Graceful degradation
  7. Testing Framework

  8. GPU kernel tests: Verificar corrección vs CPU reference
  9. Performance tests: Batch processing >50x speedup (apropiado workload)
  10. Transfer overhead tests: Medir costo CPU↔GPU
  11. Multi-GPU tests: Verificar scaling con 2+ GPUs
  12. Fallback tests: Verificar que CPU fallback funciona
  13. Memory tests: Verificar que no hay leaks
  14. Latency tests: Medir latencia end-to-end (GPU suele ser high-latency)
  15. Regression suite: Corrección matemática
  16. Test coverage >85% (GPU testing más complejo)

  17. Documentación

  18. GPU acceleration theory (cuándo usar GPU)
  19. CUDA/OpenCL programming guide
  20. Data transfer optimization
  21. Memory hierarchy (global, shared, registers)
  22. Multi-GPU programming
  23. Debugging GPU code
  24. When NOT to use GPU (latencia crítica, workload pequeño)

  25. Interfaces y Conexiones

  26. GPUAccelerator.should_offload(workload) → bool + justification
  27. GPUAccelerator.generate_kernel(topology) → GPU kernel
  28. GPUAccelerator.execute_on_gpu(data, kernel) → Results
  29. GPUAccelerator.optimize_transfers(batch) → Transfer plan
  30. Integration con parallel_execution
  31. Symlink: topology_definitions/ → ../05_TOPOLOGY_DESIGN/

ENTREGABLES: - [ ] Workload classifier (identificar buenos candidatos) - [ ] CUDA/OpenCL code generator - [ ] Transfer optimizer - [ ] Multi-GPU support - [ ] CPU fallback - [ ] Tests >85% coverage - [ ] Documentación de GPU acceleration - [ ] Benchmarks mostrando 50-100x speedup (batch processing)

ESTIMACIÓN: 4 semanas


TAREA 9: Platform-Specific Optimization - El Adaptador de Hardware

Carpeta: 05_06_08_platform_specific Criticidad: ⭐⭐⭐⭐ (Alta) Prioridad: TIER 1 (necesario antes de optimizaciones)

DESARROLLO:

  1. Core Implementation
  2. CPU Feature Detector:
    • CPUID wrapper (x86 feature detection)
    • Runtime ISA detection (SSE, SSE2, SSE3, SSSE3, SSE4, AVX, AVX2, AVX-512, FMA)
    • ARM capabilities (NEON, Crypto extensions)
    • Cache size detector (L1/L2/L3)
    • Core count detector
  3. Platform Abstraction Layer:
    • Unified API para operaciones platform-specific
    • Backend selector (runtime dispatch a implementación óptima)
    • Feature flag system
  4. Instruction Selection:
    • FMA code path (fused multiply-add)
    • Fast reciprocal sqrt (rsqrt approximation + Newton-Raphson)
    • Fast division (reciprocal + multiply)
    • Platform-specific intrinsics wrapper
  5. Micro-architecture Tuning:
    • CPU model detector (Intel Skylake vs AMD Zen vs ARM Cortex)
    • Latency/throughput tables por CPU
    • Scheduling optimizer según características
  6. Auto-tuning Infrastructure:

    • Benchmark runner (probar variantes en runtime)
    • Configuration cache (guardar mejor configuración por hardware)
    • Offline tuning support (pre-tune en build farm)
  7. Testing Framework

  8. Feature detection tests: Verificar detección correcta
  9. Dispatch tests: Verificar que se selecciona backend óptimo
  10. Correctness tests: Todas las variantes producen mismo output
  11. Performance tests: Platform-specific >20% faster que generic
  12. Portability tests: Código funciona en 5+ architectures
  13. Auto-tuning tests: Encuentra configuración óptima
  14. Regression suite: Corrección en todas las platforms
  15. Test coverage >90%

  16. Documentación

  17. Platform support matrix (qué features en qué CPUs)
  18. Feature detection guide
  19. Writing platform-specific code
  20. Auto-tuning guide
  21. Porting guide (añadir nueva platform)
  22. Performance comparison (generic vs SSE vs AVX vs NEON)

  23. Interfaces y Conexiones

  24. PlatformDetector.get_capabilities() → CapabilitySet
  25. PlatformOptimizer.select_backend(operation) → Backend
  26. PlatformOptimizer.tune_for_hardware() → TuningConfig
  27. PlatformOptimizer.benchmark_variants(variants) → BenchmarkResults
  28. Integration con vectorization_engine
  29. Foundation para todos los demás optimizations

ENTREGABLES: - [ ] CPU feature detector (x86, ARM) - [ ] Platform abstraction layer - [ ] Runtime dispatch system - [ ] Auto-tuning infrastructure - [ ] Tests >90% coverage en múltiples platforms - [ ] Documentación de platform support - [ ] Performance comparison report

ESTIMACIÓN: 2 semanas


TAREA 10: Profile-Guided Optimization - El Optimizador Inteligente

Carpeta: 05_06_09_profile_guided_optimization Criticidad: ⭐⭐⭐⭐ (Alta - datos reales > guessing) Prioridad: TIER 5 (necesita todo lo demás primero)

DESARROLLO:

  1. Core Implementation
  2. Instrumentation System:
    • Code instrumenter (insertar counters sin mucho overhead)
    • Branch coverage tracker (qué branches se toman más)
    • Function call counter (qué funciones son hot)
    • Cache miss tracker (lightweight sampling)
    • Execution time profiler
  3. Profile Data Collector:
    • Profile database (almacenar datos de múltiples runs)
    • Data aggregator (combinar profiles de diferentes workloads)
    • Profile viewer/analyzer
  4. Optimization Selector:
    • Hot path identifier (código que consume 80% del tiempo)
    • Cold code detector (código que casi nunca se ejecuta)
    • Branch probability estimator (para prediction hints)
    • Inline decision maker (inline funciones hot)
  5. Feedback Loop:
    • Re-compilation trigger (cuando profile data cambia)
    • A/B testing framework (comparar optimizaciones)
    • Regression detector (optimización que empeora performance)
  6. Continuous Profiling:

    • Lightweight always-on profiling (overhead <1%)
    • Periodic re-optimization
    • Adaptive optimization (ajustar a workload real de usuario)
  7. Testing Framework

  8. Instrumentation tests: Verificar overhead <5%
  9. Profile accuracy tests: Verificar que datos son correctos
  10. Optimization tests: PGO code >20% faster que non-PGO
  11. A/B testing: Verificar que mejor variante se selecciona
  12. Continuous profiling tests: Overhead <1% verificado
  13. Regression detection tests: Degradaciones detectadas
  14. Correctness tests: Optimizaciones preservan corrección
  15. Test coverage >90%

  16. Documentación

  17. PGO theory (por qué funciona)
  18. Instrumentation guide
  19. Creating representative workloads
  20. Profile data interpretation
  21. Continuous profiling setup
  22. A/B testing methodology
  23. Case studies (real-world improvements)

  24. Interfaces y Conexiones

  25. PGO.instrument_code(code) → Instrumented code
  26. PGO.collect_profile(run) → ProfileData
  27. PGO.optimize_with_profile(code, profile) → Optimized code
  28. PGO.compare_variants(variant_a, variant_b) → WinnerVariant
  29. PGO.enable_continuous_profiling() → Profiling session
  30. Integration con todos los demás optimizers (usa profile data)
  31. Symlink: performance_metrics/ → ../18_QUALITY_METRICS/

ENTREGABLES: - [ ] Code instrumentation system - [ ] Profile data collector y analyzer - [ ] Profile-guided optimization pipeline - [ ] A/B testing framework - [ ] Continuous profiling system - [ ] Tests >90% coverage - [ ] Documentación completa - [ ] Case studies mostrando >20% improvement

ESTIMACIÓN: 3 semanas


TAREA 11: Regression Testing - El Guardián de Corrección

Carpeta: 05_06_10_regression_testing Criticidad: ⭐⭐⭐⭐⭐ (Máxima - corrección es sacrosanta) Prioridad: TIER 1 (primer subsistema a implementar)

DESARROLLO:

  1. Core Implementation
  2. Reference Implementation Manager:
    • Non-optimized reference code builder (compilar sin optimizaciones)
    • Bit-exact reference para integer operations
    • High-precision reference para float operations (float64 o arbitrary precision)
  3. Comparison Engine:
    • Sample-by-sample comparator
    • Tolerance calculator (basado en tipo de operación)
    • Difference metrics (max error, RMS error, SNR, THD+N)
    • Visual diff generator (waveform comparison)
  4. Test Suite Manager:
    • Synthetic signal generator (sines, impulses, noise, sweeps)
    • Real audio loader (música, voz, various genres)
    • Edge case generator (silence, DC, Nyquist, clipping, NaN, inf)
    • Stress test generator (extreme values, long durations)
  5. Automation System:
    • CI/CD integration (run tests on every commit)
    • Test result database (histórico de resultados)
    • Regression detector (alertar si tests fallan)
    • Bisect support (encontrar commit que introdujo regression)
  6. Performance Regression Detection:
    • Benchmark database (histórico de performance)
    • Statistical analyzer (detectar degradaciones significativas)
    • Alert system (notify si performance drop >10%)
  7. Golden File System:

    • Expected outputs storage
    • Golden file updater (cuando optimización cambia output legítimamente)
    • Diff viewer
  8. Testing Framework

  9. Meta-tests: Tests de los tests (test framework correctness)
  10. Tolerance tests: Verificar que tolerancias son apropiadas
  11. Coverage tests: 100% de kernels optimizados tienen tests
  12. Performance tests: Test suite ejecuta en <5 minutos
  13. Correctness validation: Reference implementation vs optimized
  14. False positive tests: Tests no deben fallar en código correcto
  15. Test coverage: >95% (crítico para regression testing)

  16. Documentación

  17. Testing methodology
  18. Tolerance guidelines (cuándo usar qué tolerancia)
  19. Creating test cases
  20. Interpreting test failures
  21. Debugging regression failures
  22. Golden file management
  23. CI/CD integration guide

  24. Interfaces y Conexiones

  25. RegressionTester.compare(reference, optimized, tolerance) → TestResult
  26. RegressionTester.run_suite(test_suite) → SuiteResults
  27. RegressionTester.detect_regression(commit) → RegressionReport
  28. RegressionTester.bisect_failure(start, end) → Guilty commit
  29. RegressionTester.update_golden_files(test_id) → Updated files
  30. Integration con ALL optimization subsystems
  31. Symlink: regression_tests/ → ../30_TESTING_FRAMEWORK/optimization_validation/

ENTREGABLES: - [ ] Reference implementation system - [ ] Comparison engine con múltiples métricas - [ ] Comprehensive test suite (synthetic + real audio) - [ ] CI/CD automation - [ ] Performance regression detector - [ ] Golden file system - [ ] Tests >95% coverage - [ ] Documentación completa - [ ] Zero false positives/negatives

ESTIMACIÓN: 2.5 semanas


TAREA FINAL-A: Integration Testing & Validation

Carpeta: 05_06_test_integration Criticidad: ⭐⭐⭐⭐⭐ (Máxima)

DESARROLLO:

  1. End-to-End Test Suite
  2. Workflow test: Full optimization pipeline
    • Input: Naive kernel → Vectorize → Loop optimize → Inline → Memory optimize → Branch optimize → Validate correctness → Success
  3. Workflow test: Multi-threaded topology
    • Input: Topology graph → Analyze dependencies → Parallelize → Thread pool execute → Validate output == sequential → Success
  4. Workflow test: GPU offload
    • Input: Convolution workload → Classify (GPU-suitable) → Generate CUDA kernel → Transfer data → Execute → Transfer back → Validate vs CPU → Success
  5. Workflow test: Profile-guided optimization
    • Run: Instrument code → Collect profile → Identify hot paths → Apply targeted optimizations → Benchmark → Verify >20% improvement → Success
  6. Workflow test: Platform-specific dispatch

    • Run: Detect CPU features (AVX2) → Generate AVX2 code → Runtime dispatch → Execute → Validate vs generic → Verify faster → Success
  7. Cross-Subsystem Validation

  8. Vectorization ↔ Memory Optimization: Verificar alignment correcto
  9. Loop Optimization ↔ Vectorization: Unrolled loops vectorizan bien
  10. Function Optimization ↔ Expression Optimization: Inlining expone CSE
  11. Parallel Execution ↔ Memory Optimization: NUMA awareness funciona
  12. GPU Acceleration ↔ All: GPU kernels matemáticamente correctos
  13. PGO ↔ All: Profile data guía todas las optimizaciones
  14. Platform-specific ↔ All: Runtime dispatch funciona para todos

  15. Regression Test Automation

  16. Test data fixtures: 1000+ kernels de prueba (todos los tipos DSP)
  17. Snapshot testing: DB de expected outputs
  18. Golden file testing: Audio files reference
  19. CI/CD matrix: Test en x86_64, ARM64, Windows, Linux, macOS
  20. Nightly builds: Full test suite cada noche
  21. PR gating: Tests must pass antes de merge

  22. Performance Validation Suite

  23. Benchmarks: Cada kernel optimizado vs naive (target: 5-10x)
  24. Vectorization coverage: >80% de loops críticos vectorizados
  25. Cache hit rate: >95% en L1 cache
  26. Multi-core scaling: 80%+ efficiency
  27. GPU speedup: >50x para batch workloads
  28. Overall system: Real-world audio apps usan 50-70% menos CPU

  29. Stress & Load Testing

  30. Concurrency stress: 100 threads simultáneos
  31. Memory stress: Procesar 10GB de audio sin leaks
  32. Duration stress: 24hr continuous processing
  33. Platform stress: Test en CPUs viejos (SSE2-only) y nuevos (AVX-512)
  34. Edge cases stress: NaN, inf, denormals, extreme gains

ENTREGABLES: - [ ] 5 end-to-end workflows tested - [ ] Cross-subsystem validation pasando - [ ] Regression suite automatizada (CI/CD) - [ ] Performance targets met (5-10x speedup) - [ ] Stress tests passing - [ ] Multi-platform validation

ESTIMACIÓN: 3 semanas


TAREA FINAL-B: System Integration

Carpeta: 05_06_interfaces Criticidad: ⭐⭐⭐⭐ (Alta)

DESARROLLO:

  1. Conectores con subsistemas externos (según SYMLINKS)
  2. kernel_implementations/ → ../04_KERNELS_L0/
    • Kernel loader (cargar kernels L0 para optimizar)
    • Optimized kernel exporter (exportar versiones optimizadas)
  3. topology_definitions/ → ../05_TOPOLOGY_DESIGN/
    • Topology importer (leer topology graphs)
    • Dependency analyzer integration
    • Optimized topology exporter
  4. algorithm_specs/ → ../03_ALGORITHM_SPEC/
    • Mathematical spec validator (verificar que optimización preserva matemática)
  5. performance_metrics/ → ../18_QUALITY_METRICS/
    • Benchmark data importer
    • Profile data importer
    • Performance report exporter
  6. regression_tests/ → ../30_TESTING_FRAMEWORK/optimization_validation/
    • Test suite runner integration
    • Test result reporter
  7. optimized_code/ → ../27_IMPLEMENTATIONS/optimized/
    • Optimized code exporter
    • Build integration
  8. profile_data/ → ../18_QUALITY_METRICS/profiling/

    • Profile data sync
    • Continuous profiling integration
  9. Event Bus Implementation

  10. Event types:
    • OptimizationStarted, OptimizationCompleted, OptimizationFailed
    • VectorizationSucceeded(loop_id, speedup), VectorizationFailed(loop_id, reason)
    • RegressionDetected(test_id, expected, actual)
    • PerformanceImproved(kernel_id, old_cycles, new_cycles)
  11. Pub/Sub pattern (in-memory event bus)
  12. Event persistence (audit log para debugging)
  13. Webhook delivery (notify external systems)

  14. Shared State Management

  15. Optimization configuration cache (global settings)
  16. Platform capability cache (detected features)
  17. Profile data cache (hot data in-memory)
  18. Benchmark results cache
  19. Thread-safe access to all caches

  20. Communication Protocols

  21. Internal API: Function calls (direct linking)
  22. External API: REST endpoints para monitoring/control
  23. Message format: JSON para configuration, binary para audio data
  24. Compiler integration: Compiler plugins/passes

ENTREGABLES: - [ ] 7 symlink connections funcionales - [ ] Event bus operational - [ ] Shared caches implemented - [ ] REST API para monitoring - [ ] Integration tests passing

ESTIMACIÓN: 2 semanas


TAREA FINAL-C: Documentation Package

Carpeta: 05_06_documentation Criticidad: ⭐⭐⭐⭐ (Alta)

DESARROLLO:

  1. Complete API Reference
  2. C++ API: Doxygen-generated docs
  3. Optimization APIs: Cada subsistema documentado
  4. Code examples: 100+ snippets
  5. Configuration reference: Todas las flags/knobs

  6. Developer Guide

  7. Getting started: Setup optimization pipeline
  8. Architecture deep-dive: Cómo funciona cada optimizer
  9. Writing optimizable code: Best practices
  10. Debugging optimization issues
  11. Performance tuning guide
  12. Platform-specific programming
  13. GPU programming guide

  14. User Manual

  15. Installation guide (dependencies, build)
  16. Configuration guide (optimization levels, targets)
  17. Profiling guide (cómo obtener profile data)
  18. Benchmarking guide (medir improvements)
  19. Troubleshooting FAQ

  20. Theory Documentation

  21. SIMD programming theory
  22. Cache optimization theory
  23. Compiler optimization theory
  24. Parallel programming theory
  25. GPU architecture
  26. Performance analysis methodology

  27. Case Studies

  28. Reverb optimization: 100x speedup con GPU
  29. Filter bank optimization: 8x speedup con vectorización
  30. Synth voice optimization: 4x más voces con multi-threading
  31. Real-world application: Audio plugin CPU usage 70% reducido

  32. Architecture Diagrams

  33. System architecture (high-level)
  34. Optimization pipeline flow
  35. Vectorization process
  36. Memory hierarchy
  37. Multi-threading architecture
  38. GPU offload architecture

ENTREGABLES: - [ ] Complete API reference (Doxygen) - [ ] Developer guide (150+ páginas) - [ ] User manual (50+ páginas) - [ ] Theory documentation (100+ páginas) - [ ] 5+ case studies con benchmarks reales - [ ] 15+ architecture diagrams - [ ] Documentation site deployed

ESTIMACIÓN: 3 semanas


RESUMEN DE ESTIMACIONES

Fase Tareas Duración Total Parallelizable
TIER 1 - Foundation T10, T8 4.5 semanas Sí (2 devs) → 2.5 sem
TIER 2 - Core Optimizations T1, T4, T3 8 semanas Sí (3 devs) → 3 sem
TIER 3 - Advanced Optimizations T2, T5, T6 6 semanas Sí (3 devs) → 2.5 sem
TIER 4 - Parallel Execution T7, T8 7 semanas Sí (2 devs) → 4 sem
TIER 5 - Intelligent Optimization T9 3 semanas No → 3 sem
TIER 6 - Integration FA, FB, FC 8 semanas Parcial (2 devs) → 5 sem

TOTAL SECUENCIAL: ~36.5 semanas TOTAL PARALELO (3 devs): ~20 semanas (~5 meses)


CRITERIOS DE ÉXITO GLOBALES

Funcionalidad

  • Vectorización funcional en SSE, AVX, NEON
  • Loop optimizations aplicándose automáticamente
  • Memory layout optimizado (cache hit rate >95%)
  • Multi-threading con 80%+ efficiency
  • GPU acceleration para workloads apropiados
  • PGO pipeline completo

Performance

  • Speedup general: 5-10x vs naive (Fase 2)
  • Vectorization coverage: >80% loops críticos
  • Multi-core scaling: 80%+ efficiency
  • Cache efficiency: >95% L1 hits
  • GPU speedup: >50x batch processing
  • Real-world apps: 50-70% menos CPU usage

Calidad

  • Test coverage >90% (promedio todos los subsistemas)
  • 100% regression tests pasando
  • 0 degradaciones de corrección
  • Performance consistency <5% variation
  • Platform portability: 5+ architectures

Usabilidad

  • Optimizaciones 90% automáticas
  • Developer intervention <10% casos
  • Compilation time <5 minutos
  • Clear error messages cuando optimización falla
  • Comprehensive documentation

# Conexión con kernels L0
🔗 kernel_implementations/  ../04_KERNELS_L0/

# Conexión con topologías
🔗 topology_definitions/  ../05_TOPOLOGY_DESIGN/

# Conexión con specs matemáticas
🔗 algorithm_specs/  ../03_ALGORITHM_SPEC/

# Conexión con benchmarks
🔗 performance_metrics/  ../18_QUALITY_METRICS/

# Conexión con tests
🔗 regression_tests/  ../30_TESTING_FRAMEWORK/optimization_validation/

# Conexión con implementaciones
🔗 optimized_code/  ../27_IMPLEMENTATIONS/optimized/

# Conexión con profiling
🔗 profile_data/  ../18_QUALITY_METRICS/profiling/

ANTIPATTERNS A EVITAR

🚫 Premature optimization - Optimizar antes de tener código funcionando correctamente

🚫 Micro-optimization obsession - Optimizar loops irrelevantes mientras hot paths no-optimizados

🚫 Optimización sin profiling - Adivinar qué optimizar en vez de medir

🚫 Breaking correctness for speed - Optimización que produce output diferente es bug

🚫 Platform-specific code sin fallback - Código que solo funciona en CPU específico

🚫 Ignorar memory hierarchy - Optimizar CPU mientras cache thrashing domina

🚫 Over-threading - Más threads que cores (context switching overhead)

🚫 Optimización sin validación - No verificar que optimización preserva corrección

🚫 Magic numbers sin explicación - Hard-code valores sin documentar por qué

🚫 Optimización manual rígida - No usar auto-tuning para adaptación

🚫 Sin regression tests - Optimización nueva rompe edge cases silenciosamente

🚫 Ignorar diferentes workloads - Optimizar para caso promedio, pésimo en extremos


PRÓXIMOS PASOS

Este plan debe ejecutarse siguiendo el orden de TIERs para respetar dependencias técnicas. Se recomienda:

  1. Comenzar con TIER 1 (Regression Testing + Platform-Specific) - 2.5 semanas con 2 devs
  2. Continuar con TIER 2 (Vectorization + Expression + Memory) - 3 semanas con 3 devs
  3. Implementar TIER 3 (Loop + Branch + Function) - 2.5 semanas con 3 devs
  4. Expandir a TIER 4 (Parallel + GPU) - 4 semanas con 2 devs
  5. Implementar TIER 5 (PGO) - 3 semanas
  6. Finalizar con TIER 6 (Integration + Testing + Documentation) - 5 semanas con 2 devs

Timeline optimizado total: ~20 semanas con equipo de 3 desarrolladores especializados

Fase 1 (Optimizaciones Básicas): TIER 1 + TIER 2 = ~5.5 semanas → Speedup 2-4x Fase 2 (Optimizaciones Avanzadas): TIER 3 + TIER 4 = ~6.5 semanas → Speedup 5-10x Fase 3 (Optimización Extrema): TIER 5 + TIER 6 = ~8 semanas → Speedup 50-100x (GPU)