GitHub – ramanujammv1988/edge-veda

🚀 Explore this awesome post from Hacker News 📖

📂 **Category**:

💡 **What You’ll Learn**:

A managed on-device AI runtime for Flutter — text, vision, speech, and RAG running sustainably on real phones under real constraints. Private by default.

~22,700 LOC | 50 C API functions | 32 Dart SDK files | 0 cloud dependencies

pub package
Platform
License


Modern on-device AI demos break instantly in real usage:

  • Thermal throttling collapses throughput
  • Memory spikes cause silent crashes
  • Sessions longer than ~60 seconds become unstable
  • Developers have no visibility into runtime behavior
  • Debugging failures is nearly impossible

Edge-Veda exists to make on-device AI predictable, observable, and sustainable — not just runnable.


Edge-Veda is a supervised on-device AI runtime that:

  • Runs text, vision, and speech models fully on device
  • Keeps models alive across long sessions
  • Adapts automatically to thermal, memory, and battery pressure
  • Applies runtime policies instead of crashing
  • Provides structured observability for debugging and analysis
  • Supports structured output, function calling, embeddings, and RAG
  • Is private by default (no network calls during inference)

What Makes Edge-Veda Different

Edge-Veda is designed for behavior over time, not benchmark bursts.

  • A long-lived runtime with persistent workers
  • A system that supervises AI under physical device limits
  • A runtime that degrades gracefully instead of failing
  • An observable, debuggable on-device AI layer
  • A complete on-device AI stack: inference, speech, tools, and retrieval

  • Persistent text and vision inference workers (models load once, stay in memory)
  • Streaming token generation with pull-based architecture
  • Multi-turn chat session management with auto-summarization at context overflow
  • Chat templates: Llama 3 Instruct, ChatML, Qwen3/Hermes, generic
  • On-device speech recognition via whisper.cpp (Metal GPU accelerated)
  • Real-time streaming transcription in 3-second chunks
  • 48kHz native audio capture with automatic downsampling to 16kHz
  • WhisperWorker isolate for non-blocking transcription
  • ~670ms per chunk on iPhone with Metal GPU (whisper-tiny.en, 77MB)

Structured Output & Function Calling

  • GBNF grammar-constrained generation for structured JSON output
  • Tool/function calling with ToolDefinition, ToolRegistry, and schema validation
  • Multi-round tool chains with configurable max rounds
  • sendWithTools() for automatic tool call/result cycling
  • sendStructured() for grammar-constrained generation
  • Text embeddings via ev_embed() with L2 normalization
  • Per-token confidence scoring from softmax entropy
  • Cloud handoff signal when average confidence drops below threshold
  • VectorIndex — pure Dart HNSW with cosine similarity and JSON persistence
  • RagPipeline — end-to-end embed, search, inject, generate
  • Compute budget contracts — declare p95 latency, battery drain, thermal, and memory ceilings
  • Adaptive budget profiles — auto-calibrate to measured device performance
  • Central scheduler arbitrates concurrent workloads with priority-based degradation
  • Thermal, memory, and battery-aware runtime policy with hysteresis
  • Backpressure-controlled frame processing (drop-newest, not queue-forever)
  • Structured performance tracing (JSONL) with offline analysis tooling
  • Long-session stability validated on-device (12+ minutes, 0 crashes, 0 model reloads)
  • DeviceProfile detects iPhone model, RAM, chip generation, and device tier (low/medium/high/ultra)
  • MemoryEstimator with calibrated bytes-per-parameter formulas for accurate fit prediction
  • ModelAdvisor scores models 0–100 across fit, quality, speed, and context dimensions
  • Use-case weighted recommendations (chat, reasoning, vision, speech, fast)
  • Optimal EdgeVedaConfig generated per model+device pair (context length, threads, memory limit)
  • canRun() for quick fit check before download, checkStorageAvailability() for disk space

Flutter App (Dart)
    |
    +-- ChatSession ---------- Chat templates, context summarization, tool calling
    +-- WhisperSession ------- Streaming STT with 3s audio chunks
    +-- RagPipeline ---------- Embed → search → inject → generate
    +-- VectorIndex ---------- HNSW-backed vector search with persistence
    |
    +-- EdgeVeda ------------- generate(), generateStream(), embed(), describeImage()
    |
    +-- StreamingWorker ------ Persistent isolate, keeps text model loaded
    +-- VisionWorker --------- Persistent isolate, keeps VLM loaded (~600MB)
    +-- WhisperWorker -------- Persistent isolate, keeps whisper model loaded
    |
    +-- Scheduler ------------ Central budget enforcer, priority-based degradation
    +-- EdgeVedaBudget ------- Declarative constraints (p95, battery, thermal, memory)
    +-- RuntimePolicy -------- Thermal/battery/memory QoS with hysteresis
    +-- TelemetryService ----- iOS thermal, battery, memory polling
    +-- FrameQueue ----------- Drop-newest backpressure for camera frames
    +-- PerfTrace ------------ JSONL flight recorder for offline analysis
    +-- ModelAdvisor --------- Device-aware model recommendations + 4D scoring
    +-- DeviceProfile -------- iPhone model/RAM/chip detection via sysctl
    +-- MemoryEstimator ------ Calibrated model memory prediction
    |
    +-- FFI Bindings --------- 50 C functions via DynamicLibrary.process()
         |
    XCFramework (libedge_veda_full.a)
    +-- engine.cpp ----------- Text inference + embeddings + confidence (wraps llama.cpp)
    +-- vision_engine.cpp ---- Vision inference (wraps libmtmd)
    +-- whisper_engine.cpp --- Speech-to-text (wraps whisper.cpp)
    +-- memory_guard.cpp ----- Cross-platform RSS monitoring, pressure callbacks
    +-- llama.cpp b7952 ------ Metal GPU, ARM NEON, GGUF models (unmodified)
    +-- whisper.cpp v1.8.3 --- Metal GPU, shared ggml backend (unmodified)

Key design constraint: Dart FFI is synchronous — calling llama.cpp directly would freeze the UI. All inference runs in background isolates. Native pointers never cross isolate boundaries. Workers maintain persistent contexts so models load once and stay in memory across the entire session.


# pubspec.yaml
dependencies:
  edge_veda: ^2.1.0
final edgeVeda = EdgeVeda();

await edgeVeda.init(EdgeVedaConfig(
  modelPath: modelPath,
  contextLength: 2048,
  useGpu: true,
));

// Streaming
await for (final chunk in edgeVeda.generateStream('Explain recursion briefly')) 🔥

// Blocking
final response = await edgeVeda.generate('Hello from on-device AI');
print(response.text);
final session = ChatSession(
  edgeVeda: edgeVeda,
  preset: SystemPromptPreset.coder,
);

await for (final chunk in session.sendStream('Write hello world in Python')) {
  stdout.write(chunk.token);
}

// Model remembers the conversation
await for (final chunk in session.sendStream('Now convert it to Rust')) {
  stdout.write(chunk.token);
}

print('Turns: ${session.turnCount}');
print('Context: ${(session.contextUsage * 100).toInt()}%');
final tools = ToolRegistry([
  ToolDefinition(
    name: 'get_time',
    description: 'Get the current time',
    parameters: {
      'type': 'object',
      'properties': {
        'timezone': {'type': 'string', 'enum': ['UTC', 'EST', 'PST']},
      },
      'required': ['timezone'],
    },
  ),
]);

final session = ChatSession(
  edgeVeda: edgeVeda,
  tools: tools,
  templateFormat: ChatTemplateFormat.qwen3,
);

final response = await session.sendWithTools(
  'What time is it in UTC?',
  onToolCall: (call) async {
    if (call.name == 'get_time') {
      return ToolResult.success(
        toolCallId: call.id,
        data: {'time': DateTime.now().toIso8601String()},
      );
    }
    return ToolResult.failure(toolCallId: call.id, error: 'Unknown tool');
  },
);
final session = WhisperSession(modelPath: whisperModelPath);
await session.start();

// Listen for transcription segments
session.onSegment.listen((segment) {
  print('[${segment.startMs}ms] ${segment.text}');
});

// Feed audio from microphone
final audioSub = WhisperSession.microphone().listen((samples) {
  session.feedAudio(samples);
});

// Stop and get full transcript
await session.flush();
await session.stop();
print(session.transcript);
// Generate embeddings
final result = await edgeVeda.embed('On-device AI is the future');
print('Dimensions: ${result.embedding.length}');

// Build a vector index
final index = VectorIndex(dimensions: result.embedding.length);
index.add('doc1', result.embedding, metadata: {'source': 'readme'});
await index.save('/path/to/index.json');

// RAG pipeline
final rag = RagPipeline(
  edgeVeda: edgeVeda,
  index: index,
  config: RagConfig(topK: 3),
);
final answer = await rag.query('What is Edge-Veda?');
print(answer.text);

Continuous Vision Inference

final visionWorker = VisionWorker();
await visionWorker.spawn();
await visionWorker.initVision(
  modelPath: vlmModelPath,
  mmprojPath: mmprojPath,
  numThreads: 4,
  contextSize: 2048,
  useGpu: true,
);

// Process camera frames — model stays loaded across all calls
final result = await visionWorker.describeFrame(
  rgbBytes, width, height,
  prompt: 'Describe what you see.',
  maxTokens: 100,
);
print(result.description);

Edge-Veda continuously monitors:

  • Device thermal state (nominal / fair / serious / critical)
  • Available memory (os_proc_available_memory)
  • Battery level and Low Power Mode

Based on these signals, it dynamically adjusts:

QoS Level FPS Resolution Tokens Trigger
Full 2 640px 100 No pressure
Reduced 1 480px 75 Thermal warning, battery <15%, memory <200MB
Minimal 1 320px 50 Thermal serious, battery <5%, memory <100MB
Paused 0 0 Thermal critical, memory <50MB

Escalation is immediate. Thermal spikes are dangerous and must be responded to without delay.

Restoration requires cooldown (60s per level) and happens one level at a time. Full recovery from paused to full takes 3 minutes. This prevents oscillation where the system rapidly alternates between high and low quality.


Declare runtime guarantees. The Scheduler enforces them.

// Option 1: Adaptive — auto-calibrates to this device's actual performance
final scheduler = Scheduler(telemetry: TelemetryService());
scheduler.setBudget(EdgeVedaBudget.adaptive(BudgetProfile.balanced));

// Option 2: Static — explicit values
scheduler.setBudget(const EdgeVedaBudget(
  p95LatencyMs: 3000,
  batteryDrainPerTenMinutes: 5.0,
  maxThermalLevel: 2,
));

// Register workloads with priorities
scheduler.registerWorkload(WorkloadId.vision, priority: WorkloadPriority.high);
scheduler.registerWorkload(WorkloadId.text, priority: WorkloadPriority.low);
scheduler.registerWorkload(WorkloadId.stt, priority: WorkloadPriority.low);
scheduler.start();

// React to violations
scheduler.onBudgetViolation.listen((v) {
  print('${v.constraint}: ${v.currentValue} > ${v.budgetValue}');
});

Adaptive profiles resolve against measured device performance after warm-up:

Profile p95 Multiplier Battery Thermal Use Case
Conservative 2.0x 0.6x (strict) Floor 1 Background workloads
Balanced 1.5x 1.0x (match) Floor 2 Default for most apps
Performance 1.1x 1.5x (generous) Allow 3 Latency-sensitive apps


All numbers measured on a physical iPhone (A16 Bionic, 6GB RAM, iOS 26.2.1) with Metal GPU. See BENCHMARKS.md for full details.

Metric Value
Throughput 42–43 tok/s
Steady-state memory 400–550 MB
Multi-turn stability No degradation over 10+ turns

RAG (Retrieval-Augmented Generation)

Metric Value
Generation speed 42–43 tok/s
Vector search <1 ms
End-to-end retrieval 305–865 ms

Metric Value
Sustained runtime 12.6 minutes
Frames processed 254
p50 / p95 / p99 latency 1,412 / 2,283 / 2,597 ms
Crashes / model reloads 0 / 0

Metric Value
Transcription latency (p50) ~670 ms per 3s chunk
Model size 77 MB
Streaming Real-time segments

Metric Before After
KV cache ~64 MB ~32 MB (Q8_0)
Steady-state memory ~1,200 MB peak 400–550 MB

Built-in performance flight recorder writes per-frame JSONL traces:

  • Per-stage timing (image encode / prompt eval / decode)
  • Runtime policy transitions (QoS level changes)
  • Frame drop statistics
  • Memory and thermal telemetry

Traces are analyzed offline using tools/analyze_trace.py (p50/p95/p99 stats, throughput charts, thermal overlays).


Pre-configured in ModelRegistry with download URLs and SHA-256 checksums:

Model Size Type Use Case
Llama 3.2 1B Instruct 668 MB Text (Q4_K_M) General chat, instruction following
Qwen3 0.6B 397 MB Text (Q4_K_M) Tool/function calling
Phi 3.5 Mini Instruct 2.3 GB Text (Q4_K_M) Reasoning, longer context
Gemma 2 2B Instruct 1.6 GB Text (Q4_K_M) General purpose
TinyLlama 1.1B Chat 669 MB Text (Q4_K_M) Lightweight, fast inference
SmolVLM2 500M 417 MB + 190 MB Vision (Q8_0 + F16) Image description
All MiniLM L6 v2 46 MB Embedding (F16) Document embeddings for RAG
Whisper Tiny English 77 MB Speech (F16) Speech-to-text transcription
Whisper Base English 142 MB Speech (F16) Higher-accuracy transcription

Any GGUF model compatible with llama.cpp can be loaded by file path.


Platform GPU Status
iOS (device) Metal Fully validated on-device
iOS (simulator) CPU Working (Metal stubs, no mic)
Android CPU Scaffolded, validation pending
Android (Vulkan) Planned


edge-veda/
+-- core/
|   +-- include/edge_veda.h       C API (50 functions, 858 LOC)
|   +-- src/engine.cpp            Text inference + embeddings (1,173 LOC)
|   +-- src/vision_engine.cpp     Vision inference (484 LOC)
|   +-- src/whisper_engine.cpp    Speech-to-text (290 LOC)
|   +-- src/memory_guard.cpp      Memory monitoring (625 LOC)
|   +-- third_party/llama.cpp/    llama.cpp b7952 (git submodule)
|   +-- third_party/whisper.cpp/  whisper.cpp v1.8.3 (git submodule)
+-- flutter/
|   +-- lib/                      Dart SDK (32 files, 11,750 LOC)
|   +-- ios/                      Podspec + XCFramework
|   +-- android/                  Android plugin (scaffolded)
|   +-- example/                  Demo app (10 files, 8,383 LOC)
|   +-- test/                     Unit tests (253 LOC, 14 tests)
+-- scripts/
|   +-- build-ios.sh              XCFramework build pipeline (406 LOC)
+-- tools/
|   +-- analyze_trace.py          Soak test JSONL analysis (1,797 LOC)

  • macOS with Xcode 15+ (tested with Xcode 26.1)
  • Flutter 3.16+ (tested with 3.38.9)
  • CMake 3.21+
./scripts/build-ios.sh --clean --release

Compiles llama.cpp + whisper.cpp + Edge Veda C code for device (arm64) and simulator (arm64), merges static libraries into a single XCFramework.

cd flutter/example
flutter run

The demo app includes Chat (multi-turn with tool calling), Vision (continuous camera scanning), STT (live microphone transcription), and Settings (model management, device info).


  • Android sustained runtime validation (CPU + Vulkan GPU)
  • Text-to-speech integration
  • Semantic perception APIs (event-driven vision)
  • Observability dashboard (localhost trace viewer)
  • NPU/CoreML backend support
  • Model conversion toolchain

Edge-Veda is designed for teams building:

  • On-device AI assistants
  • Continuous perception apps
  • Privacy-sensitive AI systems
  • Long-running edge agents
  • Voice-first applications
  • Regulated or offline-first applications

Contributions are welcome. Here’s how to get started:

  • Platform validation — Android CPU/Vulkan testing on real devices
  • Runtime policy — New QoS strategies, thermal adaptation improvements
  • Trace analysis — Visualization tools, anomaly detection, regression tracking
  • Model support — Testing additional GGUF models, quantization profiles
  • Example apps — Minimal examples for specific use cases (document scanner, voice assistant, visual QA)
  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/your-feature)
  3. Make changes and verify with dart analyze (SDK) and flutter analyze (demo app)
  4. Run tests: cd flutter && flutter test
  5. Commit with descriptive messages
  6. Open a Pull Request with a summary of what changed and why
  • Dart: follow standard dart format conventions
  • C++: match existing style in core/src/
  • All FFI calls must run in isolates (never on main thread)
  • New C API functions must be added to the podspec symbol whitelist

Apache 2.0


Built on llama.cpp and whisper.cpp by Georgi Gerganov and contributors.

{💬|⚡|🔥} **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#GitHub #ramanujammv1988edgeveda**

🕒 **Posted on**: 1771371449

🌟 **Want more?** Click here for more info! 🌟

By

Leave a Reply

Your email address will not be published. Required fields are marked *