AI / ML engineer working across core ML, vision-language models, and production agentic systems.
I build production AI systems: agents, LLM pipelines, and automated workflows that replace manual processes in real business operations. My background spans healthcare, supply chain, enterprise automation, and open-source ML infrastructure.
Currently working as a Forward Deployed AI Engineer at Kodamai, where I act as the primary technical partner for enterprise clients across multiple simultaneous engagements. Previously built a physician-facing clinical AI platform at Exora AI that passed HIPAA and Singapore PDPA audit and was approved for live patient use.
My foundation is in classical and deep ML: XGBoost pipelines at scale, causal inference, computer vision (YOLOv11, segmentation, SAM 2), and fine-tuning foundation models with LoRA and modern quantization (AWQ, GPTQ, GGUF). I work daily with the current generation of vision-language models (Qwen3-VL, GLM-4.6V, InternVL3.5, MiniCPM-V) on document intelligence, medical imaging, ANPR, and long-context multimodal reasoning that now stretches to 256K interleaved tokens.
I contribute to open-source ML projects in my own time. I am ranked #5 on the official Google/Flax contributor leaderboard over the last twelve months, with patches merged into the Flax NNX core. I have six merged PRs in CPython and three in Uber/CausalML.
Building production agentic systems across multiple enterprise client engagements. Current projects include Madfo3 (accounts payable automation with LangGraph and SAP/ERP integration), Nazir (ANPR with active learning, YOLOv11 vehicle detection plus Qwen3-VL verification on edge), a 20TB document intelligence pipeline on GCP Vertex AI using Qwen3-VL and layout-aware embeddings served through vLLM, and a warehouse logistics automation system with SAP HANA integration.
Designed and shipped a multi-agent clinical AI assistant serving live physician workflows. Covered the full stack: LangGraph orchestration, RAG pipelines with hybrid vector search, Whisper-based STT, multimodal embeddings (SigLIP-2 + InternVL3) for clinical imagery, PHI-compliant observability across seven microservices, and evaluation frameworks tracking clinical competency across model versions. Reduced LLM latency 35% and cost 25%. Passed HIPAA and Singapore PDPA audit.
Sole engineer on a pre-seed health and wellness startup. Built and shipped the full GenAI product in six weeks. Evaluated 8+ LLMs, selected and fine-tuned Mistral-7B with LoRA, built self-hosted inference with quantized serving, safety guardrails, and A/B evaluation framework. Demo contributed to closing a $250K pre-seed round.
Built a production price-intelligence pipeline. XGBoost models at ~12% MAPE, ~5,000 products per week at 90%+ accuracy on AWS EC2 with FastAPI, PostgreSQL, Redis, and Celery. Feature engineering across tabular and scraped data. ~99% uptime.
Delivered LLM fine-tuning (LoRA, QLoRA), NLP pipelines, computer vision systems (YOLOv5, CNN architectures, segmentation), and classical ML / forecasting for international clients across diverse domains.
TextIOWrapper.tell() assertion failure with standalone carriage return (backported to 3.13 and 3.14). PR #144696 fixed re.Match.group() doc claiming [1..99] range limit. Also documented asyncio Task cancellation propagation and inaccurate object-comparison docs.
How to build agents that know when to act and when to escalate. Most agents are overconfident about their own success rates. This matters more in regulated domains like healthcare and finance, where a wrong autonomous action has real consequences.
Offline benchmarks (MMMU, MathVista, MMStar, DocVQA) often do not predict production quality. I am interested in evaluation methods that catch what actually matters: user outcomes, business metrics, and real-world failure modes that benchmarks systematically miss.
Quantization (AWQ, GPTQ, GGUF, FP8), context compression, distillation, and SLM routing for deployment in environments with strict latency, cost, or data-residency requirements. Currently exploring how quantization affects reasoning quality in non-English languages, Urdu specifically, where the damage is often invisible to standard automatic metrics.
How well current VLMs (Qwen3-VL, GLM-4.6V, InternVL3.5, Llama 4 multimodal, Gemini 3) actually generalize to messy real-world data: medical imagery, scanned documents, surveillance feeds, long-form video. Particular interest in long-context multimodal reasoning at 256K+ tokens and the gap between MMMU/MathVision scores and production reliability.
Building systems that work reliably for speakers of non-Latin script languages, where current models, tokenizers, and evaluation infrastructure are weakest. Particular interest in how training choices (tokenization, data mixture, instruction tuning) propagate into downstream failure modes for users of under-represented languages.
Available for production AI engagements and research collaborations.
mohsinmahmood675@gmail.com