I build the platforms that ML and AI products run on — from decomposing a monolithic ML inference system into cloud-native microservices, to owning the GCP/Kubernetes substrate of an enterprise LLM platform — and on the side I ship open-source LLM-inference tooling and a co-founded marketplace. One through-line: production infrastructure done with discipline.
The same engineer and the same discipline across three threads — two production platforms built at work, and the open-source LLM tooling and marketplace I build on the side.
As Senior Software Engineer at Bodygram, decomposed a 4,000+ line monolithic ML pipeline into async microservices bound by a shared contract library and migrated the platform from AWS to GCP — cutting infrastructure costs ~70% and lifting model inference ~8×.
Owned the cloud substrate of a multi-tenant LLM platform — reusable Terraform modules, a serverless→Kubernetes migration platform, autoscaling, observability, an alert-routing service, and an AI agent for alert triage.
Builds and ships LLM-inference infrastructure — a Rust inference engine, a distributed mesh product, a model-serving library, and a co-founded marketplace.
Patterns that recur across both platforms and my own projects — derived from production code, not hypotheticals.
Every boundary is a typed, validated schema — inputs, outputs, service responses, error payloads. The schema is the API contract, and the source of truth lives in one shared place.
Long descriptive names, full type hints, no magic globals. Configuration centralized per service and loaded from the environment with typed defaults.
Optional dependencies, mock tracers when telemetry isn't configured, circuit breakers and failover. Services that fail return structured errors, not stack traces.
Health checks, readiness probes, distributed tracing, structured logging, request-ID propagation, and multi-environment configuration — built into every service from the first commit.
As Senior Software Engineer at Bodygram, I built a production ML inference platform — a shared Python library, 7+ async microservices, infrastructure-as-code for two clouds, CI/CD, and Helm charts — decomposed from a monolith and migrated AWS→GCP as the sole infrastructure engineer. It cut infrastructure costs ~70% and lifted model inference ~8× (async → real-time). The condensed highlights are below.
Read the full ML Platform portfolio →The core achievement: decomposing a 4,000+ line monolithic ML pipeline into independently deployable, async microservices orchestrated by a central broker — without taking production down.
| Aspect | Legacy |
|---|---|
| Architecture | Sequential function chain |
| I/O | Synchronous, blocking |
| Scaling | Vertical only |
| Errors | Boolean returns + logs |
| Deploy | Single container |
| Aspect | Refactored |
|---|---|
| Architecture | Async microservices |
| I/O | async/await, concurrent |
| Scaling | Horizontal (HPA per service) |
| Errors | Typed exceptions + status codes |
| Deploy | 7+ independent containers |
A purpose-built shared library enforces consistency across every service — models, exceptions, logging, observability. The orchestrator then coordinates them with async HTTP and dependency-aware parallelism.
Base models define core fields; subclasses progressively enrich the schema per pipeline stage — from a minimal set to full ground-truth.
Python — Synthetic Example class CoreOutput(BaseModel): field_alpha: PositiveInt | PositiveFloat field_beta: PositiveInt | PositiveFloat # ~7 essential fields class PlatformOutput(CoreOutput): field_epsilon: PositiveInt | PositiveFloat # ~28 fields total
One function handles all downstream calls — FormData construction, response validation against a model, status-code checks, and domain-specific exceptions.
Independent stages run in parallel; dependent stages await their inputs. Wall-clock time is minimized through careful dependency analysis.
Python — Synthetic Example async with aiohttp.ClientSession( auth=auth) as session: # Layer 1: parallel (no data deps) svc_a, svc_b = await asyncio.gather( call_service_a(session, prepared), call_service_b(session, prepared)) if svc_b.passed: # Layer 2: sequential (needs A output) svc_d = await call_service_d( session, svc_a.features) svc_e = await call_service_e(...)
Every service follows the same three-layer shape (API / service / config) with standardized /healthz & /readyz probes, distributed tracing, and request-ID middleware.
Migrated the platform from AWS CDK (TypeScript) to OpenTofu on GCP — designing GKE clusters with GPU time-sharing, multi-environment isolation, and cost-optimized spot instances — as the sole infrastructure engineer.
HCL — Synthetic Example infra/ ├─ environments/ │ ├─ dev/ # spot GPUs, scale-to-zero │ └─ prod/ # reserved instances ├─ modules/gcp/ │ ├─ kubernetes/cluster/ │ ├─ kubernetes/node_pool/ │ └─ iam/ alerts/ storage/ └─ modules/aws/ # OIDC federation
Platform / infrastructure & DevOps owner at Japan AI (a generative-AI company in the publicly-listed GENIEE group, building products such as JAPAN AI CHAT, AGENT, and SPEECH). ~700 commits over 8 months across the cloud substrate of a multi-tenant LLM platform. Condensed highlights below; details are generalized.
Read the full LLM Platform portfolio →Built the GCP infrastructure-as-code foundation as a library of reusable, validated Terraform modules — refactoring copy-pasted, per-service configuration into shared modules for IAM, service accounts, Artifact Registry, Cloud Build triggers, and alert policies.
One module definition reused across many services — with input validation and bounded provider versions, so misconfiguration fails fast at plan time instead of in production.
HCL — Synthetic Example module "service_account" { source = "../modules/iam/service_account" account_id = var.name roles = var.roles workload_identity = true # GKE KSA binding } variable "name" { type = string validation { condition = length(var.name) <= 30 error_message = "SA id must be 30 chars or fewer." } }
A production Helm chart — semver, changelog, named maintainer — that standardizes moving serverless (Cloud Run) services onto GKE, gated by a full unit-test suite.
Deep-merge of common and per-environment values, and a single-syntax volume field that auto-selects the CSI driver — plain name → PVC, gcs:// → GCS Fuse, nfs:// → NFS — with Workload Identity and an optional Cloud SQL Proxy sidecar.
helm unittest across deployment, resources, RBAC, secrets, managed DB, and every volume type — run by a GitHub Actions pipeline (lint / unit / render) and a pre-push git hook.
Queue-backed workers scale on queue depth, not CPU. The chart auto-creates the KEDA TriggerAuthentication from the deployment's own secret — eliminating a class of silent scaling failures.
YAML — Synthetic Example kind: ScaledObject spec: minReplicaCount: 1 maxReplicaCount: 10 triggers: - type: rabbitmq metadata: { queueName: tasks, queueLength: "30" }
Reliability defaults: min 2 replicas + PodDisruptionBudgets in prod, Gateway API HTTPRoute, secrets via CSI/envFrom.
A monitoring baseline (Prometheus / Grafana / Loki / Mimir), an alert-routing service, and an AI agent that triages alerts before a human is paged.
Autonomously scans priority services and receives repeat-alert webhooks, then runs a bounded multi-round LLM investigation over read-only log probes, ending in a structured verdict and escalation decision.
On my own time I build LLM-inference infrastructure and ship it as real software — an open-source Rust inference engine, a distributed mesh product, a model-serving library, and a co-founded marketplace. The same spec-driven, agent-assisted process as the work above.
View public repositories →Three of these orbit the same problem — running LLMs locally — scaling from a single-node engine to a whole-network mesh. The fourth shows the full-stack and product range behind the infrastructure work.
A single Rust binary that replaces Ollama — serves GGUF/llama.cpp everywhere and runs MLX natively on Apple Silicon through a hand-built Swift/C-ABI bridge. Four API surfaces (gRPC, HTTP/SSE, OpenAI-compatible, embeddable crate), a two-tier compressed + quantized KV cache, and continuous batching.
Pools every machine on a LAN into one on-premise cluster behind drop-in OpenAI/Ollama APIs — fair scheduling, peer-to-peer model transfer, and encryption by default, all from a single binary. Powered by spindll, so it serves GGUF/llama.cpp everywhere and runs MLX natively on Apple Silicon.
Turns any typed __call__ class into a FastAPI service by reflecting over its Pydantic types — convention-over-configuration model serving, with hand-rolled conventional-commit → semver → PyPI release automation and CI benchmarks. An independent project (not part of spindll).
A swipe-based second-hand marketplace, built full-stack solo: a FastAPI/Postgres backend, an Astro + React web app, a React Native mobile app, and a self-hosted CLIP moderation service on GCP. Product and full-stack range alongside the infrastructure work.
Tokyo-based, 2016–present. Full history on LinkedIn.
Shinjuku, Tokyo · Building the GCP/GKE platform foundations, leading workload migration to Kubernetes, and standardizing Terraform, Helm, CI/CD, and production-infrastructure patterns.
Tokyo, Japan · Advisory and hands-on AI-pipeline migration, async/real-time serving design, and Linux troubleshooting for batch + REST AI systems.
Minato, Tokyo · Architected multi-cloud CI/CD (Cloud Build + OpenTofu) for 12+ AI services and ran the full model lifecycle on GKE with multi-environment Helm and CPU/GPU scheduling. Cut infrastructure costs ~70% (monolith → microservices) and lifted model inference ~8× (async → real-time).
Setagaya, Tokyo · Productionized ML/NLP models (CRF, BERT, PyTorch, TensorFlow) with CPU/GPU parallelism and GKE cluster management; system architecture and research-engineering support.
Osaka, Japan · Supported the Butler warehouse-robotics system at client sites across Japan.
Tokyo, Japan · Diagnosed and repaired Nao/Pepper robots — root-cause analysis, SOPs, and tooling in C/Python/Shell on Linux.
Shared libraries and generic modules as the single source of truth — consistency across many independently deployed services.
A monolith decomposed into microservices, and sprawling copy-pasted Terraform refactored into a reusable module library — both with the production pipeline running throughout.
IaC across AWS and GCP, GKE cluster design, GPU-aware and event-driven autoscaling, and cost optimization through spot/scheduled scaling.
Distributed tracing, structured logging, data-driven alerting, on-call automation, and an LLM triage agent that cuts manual toil.
From shared library to service to infrastructure to CI/CD to Helm charts — one coherent engineering mind across the whole stack.
Beyond infrastructure — open-source LLM tooling (a Rust engine, a mesh product, a serving library) and a co-founded marketplace: polyglot systems, native MLX, cross-platform binaries, and open-core thinking.