Med Veda

A private medical assistant that runs entirely on your phone — MedGemma 1.5 4B reading X-rays and answering questions over patient records, with zero data leaving the device.

Year2026RoleOn-device ML · Android

KotlinJetpack Composellama.cppMedGemmaRoomCameraXQLoRAAndroid

Outcome

Built for the Kaggle Google MedGemma Challenge — a fully offline, multimodal clinical assistant on Android, with longitudinal patient records and English / Hindi / Telugu output, powered by a custom llama.cpp backend.

The problem

Clinical AI almost always means sending patient data to a server. In a hospital that is two problems at once: privacy — protected health information (PHI) leaving the device — and latency, when you need a patient's history or a read on an X-ray now, not after a round trip to the cloud.

I wanted to find out how far a small, specialised model could go if it never left the phone.

What I built

Med Veda is an Android-first medical assistant that runs the MedGemma 1.5 4B multimodal model completely on-device:

Chat with patient records — ask natural-language questions over a patient's longitudinal history, stored locally.
Read X-rays — show a chest X-ray and ask for a structured read; the model responds across heart size, lung fields, bones, and mediastinum.
Stays private — 100% local execution, so PHI never leaves the device.
Speaks the patient's language — generates output in English, Hindi, and Telugu for accessibility.
Voice input — dictate symptoms instead of typing.

Running a 4B multimodal model on a phone

The hard part was inference. I moved off MediaPipe onto a custom llama.cpp backend that runs Q4_K_M GGUF weights directly on Android, with multimodal support through the medically-tuned SigLIP vision encoder. The ~2.8 GB model downloads on first launch via a background service, then everything runs offline. Tested on a Qualcomm Innovator Development Kit (Snapdragon 8 Elite Gen 5) and a Pixel 7 Pro.

Fine-tuning for the clinic

To get consistent, clinically-shaped answers I fine-tuned with QLoRA (rank 32 on the q_proj / v_proj attention matrices) to teach SOAP-style structure and vernacular translation, then converted the Hugging Face weights to Q4_K_M GGUF for the edge.

What I learned

On-device multimodal is genuinely viable now. The ceiling is memory and careful quantization, not raw capability.
Privacy can be an architectural property, not a policy. "No egress" is something you get for free once the model runs locally — and that changes what's possible in regulated settings.

Poster & screenshots

Scroll sideways · click any photo to enlarge