Production agents, LLM pipelines, and the backend systems that carry them at scale. I write Python and Julia when the GIL gets in the way, and I patch the ML libraries I depend on. Most of my work is the layer where AI systems quietly succeed or fail: retrieval quality, evaluation, failover, deployment.
I build production AI systems: agents, LLM pipelines, and automated workflows that replace manual processes in real business operations. My background spans healthcare, supply chain, enterprise automation, and open-source ML infrastructure.
Right now I'm an AI Product Engineer at Kodamai, productizing Kelvingrove — an enterprise agent platform that validates LLM outputs against typed interfaces before execution. I joined as a Forward Deployed AI Engineer working directly with clients, and moved to the product team in May 2026. Before that I was the sole data scientist on a physician-facing clinical AI platform at Exora AI, validated with in-house physicians and pilot clinics before live patient use.
I also own the systems that carry these models in production: APIs, data pipelines, queues, caches, observability. Most of my work is as much about the systems around the model as the model itself. And I read other people's source code the way some people read novels — if something is broken I have a hard time leaving it alone, which is how most of my open-source patches happened.
Product transition team, turning Kelvingrove (typed-interface agent platform) into customer-facing products.
Enterprise AI delivery for supply chain and manufacturing clients.
Sole data scientist on LoqumAI, a clinical AI platform for triage, consultation, and SOAP documentation. Owned the AI/ML layer end-to-end plus backend services and deployment.
Built the data layer of an e-commerce price-intelligence platform.
Top Rated, 100% Job Success. 20+ delivered projects for clients across the US, UK, and Europe.
All in my own time, alongside full-time work. Every line below links to the real pull request.
↗ full activity: github.com/mohsinm-dev
A from-scratch PyTorch implementation of Kimi Delta Attention — a gated linear recurrence with fast-weight memory for efficient long-context modeling. Multi-head layer with trainable decay/learning-rate gates, separate training (scan) and generation (recurrent) modes, and a test suite verifying mathematical correctness against a naive reference.
github.com/mohsinm-dev/kda-attention →An open-source Model Context Protocol server connecting Claude and other LLMs to live horse-racing data: 35+ tools spanning entity resolution, racecards, results, and statistical and breeding analysis, with TTL caching, rate limiting, and both stdio and HTTP/SSE transports.
github.com/mohsinm-dev/racing-mcp-server →Most agents are overconfident about their own success rates. In the Exora clinical work this was the real failure mode: on ambiguous input the system would act autonomously when it should have escalated. Benchmarks did not catch it — only a physician-in-the-loop eval did. I'm interested in explicit confidence gating, escalation policies a reviewer can actually read, and evals built from real failure modes rather than academic test sets. The harder the domain, the more this matters.
The gap between a model that works in a demo and one that's affordable under real load is mostly engineering: caching, routing between large and small models, and knowing which requests actually need the expensive path.
Active reading thread, not yet production work — how systems keep learning from their own deployment without forgetting what already works.
I take on a small number of outside projects at a time. Usually reply same-day.