Platform · Infrastructure · ML Systems

Mansa Camara
Platform & ML Infrastructure Engineer

I build the platforms that ML and AI products run on — from decomposing a monolithic ML inference system into cloud-native microservices, to owning the GCP/Kubernetes substrate of an enterprise LLM platform — and on the side I ship open-source LLM-inference tooling and a co-founded marketplace. One through-line: production infrastructure done with discipline.

Python Go Rust Swift / MLX FastAPI PyTorch Pydantic Kubernetes / GKE Terraform Helm KEDA GCP / AWS Cloud Run / Knative Prometheus / Grafana OpenTelemetry asyncio
7+
Years Engineering
2
Production Platforms
4
Products Shipped
GCP·AWS
Cloud Platforms
00 — Overview

Two Platforms — and What I Ship on My Own

The same engineer and the same discipline across three threads — two production platforms built at work, and the open-source LLM tooling and marketplace I build on the side.

ML Inference Platform — Bodygram

As Senior Software Engineer at Bodygram, decomposed a 4,000+ line monolithic ML pipeline into async microservices bound by a shared contract library and migrated the platform from AWS to GCP — cutting infrastructure costs ~70% and lifting model inference ~8×.

2023 – 2025 Bodygram ML systems

Enterprise LLM Platform

Owned the cloud substrate of a multi-tenant LLM platform — reusable Terraform modules, a serverless→Kubernetes migration platform, autoscaling, observability, an alert-routing service, and an AI agent for alert triage.

2025 – 2026 platform / infra AI-augmented

Open Source & Products

Builds and ships LLM-inference infrastructure — a Rust inference engine, a distributed mesh product, a model-serving library, and a co-founded marketplace.

2024 – 2026 Rust · Swift · Python open source + product
A note on method. The earlier platform was hand-authored with only editor tab-completion; the more recent work was produced with modern coding agents in the loop, through a spec-then-implement, reviewed workflow. The architecture and engineering judgment are mine in both; the tooling simply changed.
01 — Philosophy

Engineering Principles

Patterns that recur across both platforms and my own projects — derived from production code, not hypotheticals.

Data Contracts First

Every boundary is a typed, validated schema — inputs, outputs, service responses, error payloads. The schema is the API contract, and the source of truth lives in one shared place.

Explicit Over Clever

Long descriptive names, full type hints, no magic globals. Configuration centralized per service and loaded from the environment with typed defaults.

Graceful Degradation

Optional dependencies, mock tracers when telemetry isn't configured, circuit breakers and failover. Services that fail return structured errors, not stack traces.

Production From Day One

Health checks, readiness probes, distributed tracing, structured logging, request-ID propagation, and multi-environment configuration — built into every service from the first commit.

2023 – 2025 · ML Systems · Bodygram

ML Inference Platform — Bodygram

As Senior Software Engineer at Bodygram, I built a production ML inference platform — a shared Python library, 7+ async microservices, infrastructure-as-code for two clouds, CI/CD, and Helm charts — decomposed from a monolith and migrated AWS→GCP as the sole infrastructure engineer. It cut infrastructure costs ~70% and lifted model inference ~8× (async → real-time). The condensed highlights are below.

Read the full ML Platform portfolio →
02 — ML Platform · Architecture

From Monolith to Microservices

The core achievement: decomposing a 4,000+ line monolithic ML pipeline into independently deployable, async microservices orchestrated by a central broker — without taking production down.

Client Request
images + metadata
Async Orchestrator
FastAPI broker • routes to downstream services • manages session
▼ parallel — no data dependency between these
Service A
feature extraction
Service B
input validation
Service C
image preprocessing
▼ sequential — depends on outputs from above
Service D
needs A + C outputs
N models in parallel
Service E
ensemble inference
Aggregated Response
structured output from all stages

Before: Monolithic Pipeline

AspectLegacy
ArchitectureSequential function chain
I/OSynchronous, blocking
ScalingVertical only
ErrorsBoolean returns + logs
DeploySingle container

After: Microservices Platform

AspectRefactored
ArchitectureAsync microservices
I/Oasync/await, concurrent
ScalingHorizontal (HPA per service)
ErrorsTyped exceptions + status codes
Deploy7+ independent containers
03 — ML Platform · Contracts & Orchestration

The Contract Layer & Async Pipeline

A purpose-built shared library enforces consistency across every service — models, exceptions, logging, observability. The orchestrator then coordinates them with async HTTP and dependency-aware parallelism.

Progressive Data Contracts

Base models define core fields; subclasses progressively enrich the schema per pipeline stage — from a minimal set to full ground-truth.

Python — Synthetic Example
class CoreOutput(BaseModel):
    field_alpha: PositiveInt | PositiveFloat
    field_beta:  PositiveInt | PositiveFloat
    # ~7 essential fields

class PlatformOutput(CoreOutput):
    field_epsilon: PositiveInt | PositiveFloat
    # ~28 fields total

Generic Service Adapter

One function handles all downstream calls — FormData construction, response validation against a model, status-code checks, and domain-specific exceptions.

Concurrent Orchestration

Independent stages run in parallel; dependent stages await their inputs. Wall-clock time is minimized through careful dependency analysis.

Python — Synthetic Example
async with aiohttp.ClientSession(
        auth=auth) as session:
    # Layer 1: parallel (no data deps)
    svc_a, svc_b = await asyncio.gather(
        call_service_a(session, prepared),
        call_service_b(session, prepared))

    if svc_b.passed:
        # Layer 2: sequential (needs A output)
        svc_d = await call_service_d(
            session, svc_a.features)
        svc_e = await call_service_e(...)

Every service follows the same three-layer shape (API / service / config) with standardized /healthz & /readyz probes, distributed tracing, and request-ID middleware.

04 — ML Platform · Infrastructure

Cloud Architecture & AWS→GCP Migration

Migrated the platform from AWS CDK (TypeScript) to OpenTofu on GCP — designing GKE clusters with GPU time-sharing, multi-environment isolation, and cost-optimized spot instances — as the sole infrastructure engineer.

Terraform Module Structure

HCL — Synthetic Example
infra/
├─ environments/
│  ├─ dev/    # spot GPUs, scale-to-zero
│  └─ prod/   # reserved instances
├─ modules/gcp/
│  ├─ kubernetes/cluster/
│  ├─ kubernetes/node_pool/
│  └─ iam/ alerts/ storage/
└─ modules/aws/  # OIDC federation

Cost & Reliability

  • GKE with Workload Identity; Cloud Storage FUSE for model access
  • Virtual GPU time-sharing for concurrent model serving
  • Spot instances + CronJob scaling (down off-hours, up on business hours)
  • Scale-from-zero in dev; reserved, SLA-backed nodes in prod
  • Managed Prometheus + DCGM GPU metrics, multi-tier priority classes
A 14-month migration executed solo — one other engineer initialized the original AWS CDK pipeline and left mid-2024; everything after was one person learning Terraform, designing GKE clusters, and shipping to production.
2025 – 2026 · Platform / Infrastructure

Enterprise LLM Platform — Japan AI

Platform / infrastructure & DevOps owner at Japan AI (a generative-AI company in the publicly-listed GENIEE group, building products such as JAPAN AI CHAT, AGENT, and SPEECH). ~700 commits over 8 months across the cloud substrate of a multi-tenant LLM platform. Condensed highlights below; details are generalized.

Read the full LLM Platform portfolio →
05 — LLM Platform · Infrastructure as Code

Reusable Terraform Module Library

Built the GCP infrastructure-as-code foundation as a library of reusable, validated Terraform modules — refactoring copy-pasted, per-service configuration into shared modules for IAM, service accounts, Artifact Registry, Cloud Build triggers, and alert policies.

Reusable, Validated Modules

One module definition reused across many services — with input validation and bounded provider versions, so misconfiguration fails fast at plan time instead of in production.

HCL — Synthetic Example
module "service_account" {
  source            = "../modules/iam/service_account"
  account_id        = var.name
  roles             = var.roles
  workload_identity = true   # GKE KSA binding
}

variable "name" {
  type = string
  validation {
    condition     = length(var.name) <= 30
    error_message = "SA id must be 30 chars or fewer."
  }
}

Refactoring & Deploy Pipelines

  • Replaced copy-pasted, per-service Terraform with one shared module set
  • Validation blocks + bounded provider versions — fail fast, not in prod
  • Workload Identity scoping and Secret Manager → Kubernetes wiring
  • Refactored build triggers into separate Cloud Run and GKE deploy pipelines
06 — LLM Platform · Kubernetes

The Platform Helm Chart

A production Helm chart — semver, changelog, named maintainer — that standardizes moving serverless (Cloud Run) services onto GKE, gated by a full unit-test suite.

Layered Values + Volume Abstraction

Deep-merge of common and per-environment values, and a single-syntax volume field that auto-selects the CSI driver — plain name → PVC, gcs:// → GCS Fuse, nfs:// → NFS — with Workload Identity and an optional Cloud SQL Proxy sidecar.

Backed by a Full Test Suite

helm unittest across deployment, resources, RBAC, secrets, managed DB, and every volume type — run by a GitHub Actions pipeline (lint / unit / render) and a pre-push git hook.

Event-Driven Autoscaling (KEDA)

Queue-backed workers scale on queue depth, not CPU. The chart auto-creates the KEDA TriggerAuthentication from the deployment's own secret — eliminating a class of silent scaling failures.

YAML — Synthetic Example
kind: ScaledObject
spec:
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: rabbitmq
      metadata: { queueName: tasks,
                  queueLength: "30" }

Reliability defaults: min 2 replicas + PodDisruptionBudgets in prod, Gateway API HTTPRoute, secrets via CSI/envFrom.

Self-hosted agent tooling. The same platform runs Model Context Protocol (MCP) servers in-house on GKE — a custom operator turns a declarative resource into a per-tenant, scale-to-zero Knative service with Istio ingress, cert-manager TLS, and Workload Identity.
07 — LLM Platform · Observability & SRE

On-Call That Earns Its Keep — and an Alert-Triage Agent

A monitoring baseline (Prometheus / Grafana / Loki / Mimir), an alert-routing service, and an AI agent that triages alerts before a human is paged.

Data-Driven Alerting + On-Call

  • Per-service thresholds from real multi-day profiles, with headroom above p99 — saturated-by-design services don't page
  • Dead-man's-switch + meta-monitoring: the alerting pipeline watches itself
  • The service routes by severity, runs a timezone-aware rotation with PTO overrides, and threads repeats instead of flooding
  • Ticket-aware silencing with a TTL, using a Redis single-winner check for multi-pod safety

AI Agent for Alert Triage

Autonomously scans priority services and receives repeat-alert webhooks, then runs a bounded multi-round LLM investigation over read-only log probes, ending in a structured verdict and escalation decision.

  • 3-tier failover data layer behind one uniform interface
  • Circuit breaker on the LLM chokepoint; bounded executor with backpressure
  • Logs treated as untrusted input — sanitized, delimited, JSON-validated output
  • Escalation: POSTMORTEM / INVESTIGATE / MONITOR / LOG_ONLY → Slack
2024 – 2026 · Open Source & Products

Built & Shipped

On my own time I build LLM-inference infrastructure and ship it as real software — an open-source Rust inference engine, a distributed mesh product, a model-serving library, and a co-founded marketplace. The same spec-driven, agent-assisted process as the work above.

View public repositories →
08 — Open Source & Products

A Local-LLM Inference Line, and a Marketplace

Three of these orbit the same problem — running LLMs locally — scaling from a single-node engine to a whole-network mesh. The fourth shows the full-stack and product range behind the infrastructure work.

Public · Apache-2.0

spindll — Rust LLM inference server

A single Rust binary that replaces Ollama — serves GGUF/llama.cpp everywhere and runs MLX natively on Apple Silicon through a hand-built Swift/C-ABI bridge. Four API surfaces (gRPC, HTTP/SSE, OpenAI-compatible, embeddable crate), a two-tier compressed + quantized KV cache, and continuous batching.

Rust Swift / MLX llama.cpp gRPC
github.com/Iito/spindll →
Product · lmparley.com

parley — distributed inference mesh

Pools every machine on a LAN into one on-premise cluster behind drop-in OpenAI/Ollama APIs — fair scheduling, peer-to-peer model transfer, and encryption by default, all from a single binary. Powered by spindll, so it serves GGUF/llama.cpp everywhere and runs MLX natively on Apple Silicon.

Rust Swift / MLX llama.cpp OpenAI / Ollama API
lmparley.com →
Public · MIT · PyPI

fastmodel — model-serving framework

Turns any typed __call__ class into a FastAPI service by reflecting over its Pydantic types — convention-over-configuration model serving, with hand-rolled conventional-commit → semver → PyPI release automation and CI benchmarks. An independent project (not part of spindll).

Python FastAPI Pydantic
pypi.org/project/fastmodel →
Co-founded · pre-launch

Frifty — thrift marketplace

A swipe-based second-hand marketplace, built full-stack solo: a FastAPI/Postgres backend, an Astro + React web app, a React Native mobile app, and a self-hosted CLIP moderation service on GCP. Product and full-stack range alongside the infrastructure work.

FastAPI React Native Astro GCP CLIP
frifty.io →
One throughline. spindll, parley, and fastmodel all orbit local LLM/model serving — spindll powers parley, while fastmodel is a separate library (it serves the marketplace's moderation model). Across all of them: polyglot systems work (Rust · Swift · Python, with C/C++ and Swift FFI down to Metal), cross-platform signed binaries, and the same agent-harness, spec-driven engineering process used at work.
09 — Experience & Education

Career

Tokyo-based, 2016–present. Full history on LinkedIn.

Oct 2025 – Present

Senior Platform Engineer — JAPAN AI (GENIEE group)

Shinjuku, Tokyo · Building the GCP/GKE platform foundations, leading workload migration to Kubernetes, and standardizing Terraform, Helm, CI/CD, and production-infrastructure patterns.

Jun 2024 – Present

MLOps / Model-Serving Engineer — Independent

Tokyo, Japan · Advisory and hands-on AI-pipeline migration, async/real-time serving design, and Linux troubleshooting for batch + REST AI systems.

Jan 2023 – Sep 2025

Senior Software Engineer — Bodygram

Minato, Tokyo · Architected multi-cloud CI/CD (Cloud Build + OpenTofu) for 12+ AI services and ran the full model lifecycle on GKE with multi-environment Helm and CPU/GPU scheduling. Cut infrastructure costs ~70% (monolith → microservices) and lifted model inference ~8× (async → real-time).

May 2018 – Dec 2022

Data Scientist — Rakuten Institute of Technology

Setagaya, Tokyo · Productionized ML/NLP models (CRF, BERT, PyTorch, TensorFlow) with CPU/GPU parallelism and GKE cluster management; system architecture and research-engineering support.

Nov 2017 – Feb 2018

Customer Service Engineer — GreyOrange

Osaka, Japan · Supported the Butler warehouse-robotics system at client sites across Japan.

Jan 2016 – Oct 2017

Repair Technician — SoftBank Group International

Tokyo, Japan · Diagnosed and repaired Nao/Pepper robots — root-cause analysis, SOPs, and tooling in C/Python/Shell on Linux.

Education

  • DUT — Electrical Engineering & Industrial Computing, IUT, Université de Rouen · 2013–2015
  • Bachelor's — Maths / IT / Electrical & Electronic Engineering & Automation, Université de Rouen · 2012–2013
  • Electronic Engineering Certificate — LTP La Châtaigneraie · 2009–2012

Languages

  • French — native / bilingual
  • English — full professional
  • Japanese — elementary
10 — Summary

What This Body of Work Demonstrates

Systems Thinking

Shared libraries and generic modules as the single source of truth — consistency across many independently deployed services.

Refactoring at Scale

A monolith decomposed into microservices, and sprawling copy-pasted Terraform refactored into a reusable module library — both with the production pipeline running throughout.

Cloud & Kubernetes

IaC across AWS and GCP, GKE cluster design, GPU-aware and event-driven autoscaling, and cost optimization through spot/scheduled scaling.

SRE Discipline

Distributed tracing, structured logging, data-driven alerting, on-call automation, and an LLM triage agent that cuts manual toil.

End-to-End Ownership

From shared library to service to infrastructure to CI/CD to Helm charts — one coherent engineering mind across the whole stack.

Builds & Ships Products

Beyond infrastructure — open-source LLM tooling (a Rust engine, a mesh product, a serving library) and a co-founded marketplace: polyglot systems, native MLX, cross-platform binaries, and open-core thinking.

On methodology. The earlier platform was hand-authored with editor tab-completion; the recent work was produced with modern coding agents in a spec-then-implement, reviewed workflow. The architecture, decomposition, trade-offs, and operational judgment are mine throughout — the tooling is leverage on top of that.