Mansa Camara — Platform & ML Infrastructure Engineer

00 — Overview

Two Platforms — and What I Ship on My Own

The same engineer and the same discipline across three threads — two production platforms built at work, and the open-source LLM tooling and marketplace I build on the side.

◈

ML Inference Platform — Bodygram

As Senior Software Engineer at Bodygram, decomposed a 4,000+ line monolithic ML pipeline into async microservices bound by a shared contract library and migrated the platform from AWS to GCP — cutting infrastructure costs ~70% and lifting model inference ~8×.

2023 – 2025 Bodygram ML systems

◆

Enterprise LLM Platform

Owned the cloud substrate of a multi-tenant LLM platform — reusable Terraform modules, a serverless→Kubernetes migration platform, autoscaling, observability, an alert-routing service, and an AI agent for alert triage.

2025 – 2026 platform / infra AI-augmented

▲

Open Source & Products

Builds and ships LLM-inference infrastructure — a Rust inference engine, a distributed mesh product, a model-serving library, and a co-founded marketplace.

2024 – 2026 Rust · Swift · Python open source + product

01 — Philosophy

Engineering Principles

Patterns that recur across both platforms and my own projects — derived from production code, not hypotheticals.

◈

Data Contracts First

Every boundary is a typed, validated schema — inputs, outputs, service responses, error payloads. The schema is the API contract, and the source of truth lives in one shared place.

◆

Explicit Over Clever

Long descriptive names, full type hints, no magic globals. Configuration centralized per service and loaded from the environment with typed defaults.

▶

Graceful Degradation

Optional dependencies, mock tracers when telemetry isn't configured, circuit breakers and failover. Services that fail return structured errors, not stack traces.

★

Production From Day One

Health checks, readiness probes, distributed tracing, structured logging, request-ID propagation, and multi-environment configuration — built into every service from the first commit.

02 — ML Platform · Architecture

From Monolith to Microservices

The core achievement: decomposing a 4,000+ line monolithic ML pipeline into independently deployable, async microservices orchestrated by a central broker — without taking production down.

Client Request

images + metadata

Async Orchestrator

FastAPI broker • routes to downstream services • manages session

▼ parallel — no data dependency between these

Service A

feature extraction

Service B

input validation

Service C

image preprocessing

▼ sequential — depends on outputs from above

Service D

needs A + C outputs
N models in parallel

Service E

ensemble inference

Aggregated Response

structured output from all stages

Before: Monolithic Pipeline

Aspect	Legacy
Architecture	Sequential function chain
I/O	Synchronous, blocking
Scaling	Vertical only
Errors	Boolean returns + logs
Deploy	Single container

After: Microservices Platform

Aspect	Refactored
Architecture	Async microservices
I/O	async/await, concurrent
Scaling	Horizontal (HPA per service)
Errors	Typed exceptions + status codes
Deploy	7+ independent containers

03 — ML Platform · Contracts & Orchestration

The Contract Layer & Async Pipeline

A purpose-built shared library enforces consistency across every service — models, exceptions, logging, observability. The orchestrator then coordinates them with async HTTP and dependency-aware parallelism.

Progressive Data Contracts

Base models define core fields; subclasses progressively enrich the schema per pipeline stage — from a minimal set to full ground-truth.

Python — Synthetic Example
class CoreOutput(BaseModel):
    field_alpha: PositiveInt | PositiveFloat
    field_beta:  PositiveInt | PositiveFloat
    # ~7 essential fields

class PlatformOutput(CoreOutput):
    field_epsilon: PositiveInt | PositiveFloat
    # ~28 fields total

Generic Service Adapter

One function handles all downstream calls — FormData construction, response validation against a model, status-code checks, and domain-specific exceptions.

Concurrent Orchestration

Independent stages run in parallel; dependent stages await their inputs. Wall-clock time is minimized through careful dependency analysis.

Python — Synthetic Example
async with aiohttp.ClientSession(
        auth=auth) as session:
    # Layer 1: parallel (no data deps)
    svc_a, svc_b = await asyncio.gather(
        call_service_a(session, prepared),
        call_service_b(session, prepared))

    if svc_b.passed:
        # Layer 2: sequential (needs A output)
        svc_d = await call_service_d(
            session, svc_a.features)
        svc_e = await call_service_e(...)

Every service follows the same three-layer shape (API / service / config) with standardized /healthz & /readyz probes, distributed tracing, and request-ID middleware.

04 — ML Platform · Infrastructure

Cloud Architecture & AWS→GCP Migration

Migrated the platform from AWS CDK (TypeScript) to OpenTofu on GCP — designing GKE clusters with GPU time-sharing, multi-environment isolation, and cost-optimized spot instances — as the sole infrastructure engineer.

Terraform Module Structure

HCL — Synthetic Example
infra/
├─ environments/
│  ├─ dev/    # spot GPUs, scale-to-zero
│  └─ prod/   # reserved instances
├─ modules/gcp/
│  ├─ kubernetes/cluster/
│  ├─ kubernetes/node_pool/
│  └─ iam/ alerts/ storage/
└─ modules/aws/  # OIDC federation

Cost & Reliability

GKE with Workload Identity; Cloud Storage FUSE for model access
Virtual GPU time-sharing for concurrent model serving
Spot instances + CronJob scaling (down off-hours, up on working hours) for devs
Scale-from-zero in dev; reserved, SLA-backed nodes in prod
Managed Prometheus + DCGM GPU metrics, multi-tier priority classes

A 14-month migration executed solo — one other engineer initialized the original AWS CDK pipeline and left mid-2024; everything after was one person learning Terraform, designing GKE clusters, and shipping to production, whilst maintaining the existing AWS infrastructure.

05 — LLM Platform · Infrastructure as Code

Reusable Terraform Module Library

Built the GCP infrastructure-as-code foundation as a library of reusable, validated Terraform modules.
Refactoring copy-pasted, per-service configuration into shared modules templating for IAM, service accounts, Artifact Registry, Cloud Build triggers, and alert policies.

Reusable, Validated Modules

One module definition reused across many services — with input validation and bounded provider versions, so misconfiguration fails fast at plan time instead of in production.

HCL — Synthetic Example
module "service_account" {
  source            = "../modules/iam/service_account"
  account_id        = var.name
  roles             = var.roles
  workload_identity = true   # GKE KSA binding
}

variable "name" {
  type = string
  validation {
    condition     = length(var.name) <= 30
    error_message = "SA id must be 30 chars or fewer."
  }
}

Refactoring & Deploy Pipelines

Replaced copy-pasted, per-service Terraform with one shared module set
Validation blocks + bounded provider versions — fail fast, not in prod
Workload Identity scoping and Secret Manager → Kubernetes wiring
Refactored CI/CD pipelines to allow multi-platform deployment

06 — LLM Platform · Kubernetes

The Platform Helm Chart

A production Helm chart — semver, changelog, named maintainer — that standardizes moving serverless (Cloud Run) services onto GKE, gated by a full unit-test suite.

Layered Values + Volume Abstraction

Deep-merge of common and per-environment values, and a single-syntax volume field that auto-selects the CSI driver:
plain name → PVC, gcs:// → GCS Fuse, nfs:// → NFS etc...
with Workload Identity and opt-in sidecars.

Backed by a Full Test Suite

helm unittest across deployment, resources, RBAC, secrets, managed DB, and every volume type — run by a GitHub Actions pipeline (lint / unit / render) and a pre-push git hook.

Event-Driven Autoscaling (KEDA)

Queue-backed workers scale on queue depth, not CPU. The chart auto-creates the KEDA TriggerAuthentication from the deployment's own secret — eliminating a class of silent scaling failures.

YAML — Synthetic Example
kind: ScaledObject
spec:
  minReplicaCount: 1
  maxReplicaCount: 10
  triggers:
    - type: rabbitmq
      metadata: { queueName: tasks,
                  queueLength: "30" }

Reliability defaults: min 2 replicas + PodDisruptionBudgets in prod, Gateway API HTTPRoute, secrets via CSI/envFrom.

Self-hosted agent tooling. The same platform runs Model Context Protocol (MCP) servers in-house on GKE — a custom operator turns a declarative resource into a per-tenant, scale-to-zero Knative service with Istio ingress, cert-manager TLS, and Workload Identity.

07 — LLM Platform · Observability & SRE

On-Call That Earns Its Keep — and an Alert-Triage Agent

A monitoring baseline (Prometheus / Grafana / Loki / Mimir), an alert-routing service, and an AI agent that triages alerts before a human is paged.

Data-Driven Alerting + On-Call

Per-service thresholds from real multi-day profiles, with headroom above p99 — saturated-by-design services don't page
Dead-man's-switch + meta-monitoring: the alerting pipeline watches itself
The service routes by severity, runs a timezone-aware rotation with PTO overrides, and threads repeats instead of flooding
Ticket-aware silencing with a TTL, using a Redis single-winner check for multi-pod safety

AI Agent for Alert Triage

Autonomously scans priority services and receives repeat-alert webhooks, then runs a bounded multi-round LLM investigation over read-only log probes, ending in a structured verdict and escalation decision.

3-tier failover data layer behind one uniform interface
Circuit breaker on the LLM chokepoint; bounded executor with backpressure
Logs treated as untrusted input — sanitized, delimited, JSON-validated output
Escalation: POSTMORTEM / INVESTIGATE / MONITOR / LOG_ONLY → Communication channels

08 — Open Source & Products

A Local-LLM Inference Line, and a Marketplace

Three of these orbit the same problem — running models locally — scaling from a single-node engine to a whole-network mesh. The fourth shows the full-stack and product range behind the infrastructure work.

Public · Apache-2.0

spindll — Rust LLM inference server

A single Rust binary that is on par with Ollama — serves GGUF/llama.cpp everywhere and runs MLX natively on Apple Silicon through a hand-built Swift/C-ABI bridge. Four API surfaces (gRPC, HTTP/SSE, OpenAI-compatible, embeddable crate), a two-tier compressed + quantized KV cache, memory budget-aware and loading models on demand.

Rust Swift / MLX llama.cpp gRPC

github.com/Iito/spindll →

Product · lmparley.com

parley — distributed inference mesh

Pools every machine on a LAN into one on-premise cluster behind drop-in OpenAI/Ollama APIs — fair scheduling, peer-to-peer model transfer, and encryption by default, all from a single binary. Powered by spindll, so it serves GGUF/llama.cpp everywhere and runs MLX natively on Apple Silicon.

Rust Swift / MLX llama.cpp OpenAI / Ollama API

lmparley.com →

Public · MIT · PyPI

fastmodel — model-serving framework

Turns any typed __call__ class into a FastAPI service by reflecting over its Pydantic types — convention-over-configuration model serving, with hand-rolled conventional-commit → semver → PyPI release automation and CI benchmarks. An independent project (not part of spindll).

Python FastAPI Pydantic

pypi.org/project/fastmodel →

Co-founded · pre-launch

Frifty — thrift marketplace

A swipe-based second-hand marketplace, built full-stack solo: a FastAPI/Postgres backend, an Astro + React web app, a React Native mobile app, and a self-hosted CLIP moderation service on GCP. Product and full-stack range alongside the infrastructure work.

FastAPI React Native Astro GCP CLIP

frifty.io →

One throughline. spindll, parley, and fastmodel all orbit local LLM/model serving — spindll powers parley, while fastmodel is a separate library (it serves the marketplace's moderation model). Across all of them: Python engineer turned polyglot (Rust · Swift · Python and Swift FFI down to Metal) thanks to AI tools (Claude, Cursor, Codex), cross-platform signed binaries, and the same agent-harness, spec-driven engineering process used at work.
The era of AI tools is here to stay, and I'm excited to see what we can build together.

09 — Experience & Education

Career

Tokyo-based, 2016–present. Full history on LinkedIn.

Oct 2025 – Present

Senior Platform Engineer — JAPAN AI (GENIEE group)

Shinjuku, Tokyo · Building the GCP/GKE platform foundations, leading workload migration to Kubernetes, and standardizing Terraform, Helm, CI/CD, and production-infrastructure patterns.

Jun 2024 – Present

MLOps / Model-Serving Engineer — Independent

Tokyo, Japan · Advisory and hands-on AI-pipeline migration, async/real-time serving design, and Linux troubleshooting for batch + REST AI systems.

Jan 2023 – Sep 2025

Senior Software Engineer — Bodygram

Minato, Tokyo · Architected multi-cloud CI/CD (Cloud Build + OpenTofu) for 12+ AI services and ran the full model lifecycle on GKE with multi-environment Helm and CPU/GPU scheduling. Cut infrastructure costs ~70% (monolith → microservices) and lifted model inference ~8× (async → real-time).

May 2018 – Dec 2022

Data Scientist — Rakuten Institute of Technology

Setagaya, Tokyo · Productionized ML/NLP models (CRF, BERT, PyTorch, TensorFlow) with CPU/GPU parallelism and GKE cluster management; system architecture and research-engineering support.

Nov 2017 – Feb 2018

Customer Service Engineer — GreyOrange

Osaka, Japan · Supported the Butler warehouse-robotics system at client sites across Japan.

Jan 2016 – Oct 2017

Repair Technician — SoftBank Group International

Tokyo, Japan · Diagnosed and repaired Nao/Pepper robots — root-cause analysis, SOPs, and tooling in C/Python/Shell on Linux.

Education

DUT — Electrical Engineering & Industrial Computing, IUT, Université de Rouen · 2013–2015
Bachelor's — Maths / IT / Electrical & Electronic Engineering & Automation, Université de Rouen · 2012–2013
Electronic Engineering Certificate — LTP La Châtaigneraie · 2009–2012

Languages

French — native / bilingual
English — full professional
Japanese — elementary

10 — Summary

What This Body of Work Demonstrates

■

Systems Thinking

Shared libraries and generic modules as the single source of truth — consistency across many independently deployed services.

◆

Refactoring at Scale

A monolith decomposed into microservices, and sprawling copy-pasted Terraform refactored into a reusable module library — both with the production pipeline running throughout.

▶

Cloud & Kubernetes

IaC across AWS and GCP, GKE cluster design, GPU-aware and event-driven autoscaling, and cost optimization through spot/scheduled scaling.

★

SRE Discipline

Distributed tracing, structured logging, data-driven alerting, on-call automation, and an LLM triage agent that cuts manual toil.

◈

End-to-End Ownership

From shared library to service to infrastructure to CI/CD to Helm charts — one coherent engineering mind across the whole stack.

▲

Builds & Ships Products

Beyond infrastructure — open-source LLM tooling (a Rust engine, a mesh product, a serving library) and a co-founded marketplace: polyglot systems, native MLX, cross-platform binaries, and open-core thinking.

On methodology. The earlier platform was hand-authored with editor tab-completion; the recent work was produced with modern coding agents in a spec-then-implement, reviewed workflow. The architecture, decomposition, trade-offs, and operational judgment are mine throughout — the tooling is leverage on top of that.

Mansa Camara Platform & ML Infrastructure Engineer

Two Platforms — and What I Ship on My Own

ML Inference Platform — Bodygram

Enterprise LLM Platform

Open Source & Products

Engineering Principles

Data Contracts First

Explicit Over Clever

Graceful Degradation

Production From Day One

ML Inference Platform — Bodygram

From Monolith to Microservices

Before: Monolithic Pipeline

After: Microservices Platform

The Contract Layer & Async Pipeline

Progressive Data Contracts

Generic Service Adapter

Concurrent Orchestration

Cloud Architecture & AWS→GCP Migration

Terraform Module Structure

Cost & Reliability

Enterprise LLM Platform — Japan AI

Reusable Terraform Module Library

Reusable, Validated Modules

Refactoring & Deploy Pipelines

The Platform Helm Chart

Layered Values + Volume Abstraction

Backed by a Full Test Suite

Event-Driven Autoscaling (KEDA)

On-Call That Earns Its Keep — and an Alert-Triage Agent

Data-Driven Alerting + On-Call

AI Agent for Alert Triage

Built & Shipped

A Local-LLM Inference Line, and a Marketplace

spindll — Rust LLM inference server

parley — distributed inference mesh

fastmodel — model-serving framework

Frifty — thrift marketplace

Career

Senior Platform Engineer — JAPAN AI (GENIEE group)

MLOps / Model-Serving Engineer — Independent

Senior Software Engineer — Bodygram

Data Scientist — Rakuten Institute of Technology

Customer Service Engineer — GreyOrange

Repair Technician — SoftBank Group International

Education

Languages

What This Body of Work Demonstrates

Systems Thinking

Refactoring at Scale

Cloud & Kubernetes

SRE Discipline

End-to-End Ownership

Builds & Ships Products

Mansa Camara
Platform & ML Infrastructure Engineer