[
  {
    "slug": "2026-06-29-building-an-ai-patient-flow-orchestrator",
    "title": "Building an AI Patient Flow Orchestrator",
    "description": "An AI agent that watches a simulated ward, spots looming bed shortages, and proposes ranked fixes for a human to approve — and what building it taught me.",
    "tags": [
      "opencode",
      "ai-agents",
      "typescript",
      "next-js",
      "architecture",
      "from-scratch"
    ],
    "excerpt": "What it is I wanted to learn how to build a real AI agent — something that watches a changing situation and does useful work over time. So I gave it a genuinely hard problem: keeping a hospital's beds flowing. The result — I called it the Patient Flo",
    "content": "What it is I wanted to learn how to build a real AI agent — something that watches a changing situation and does useful work over time. So I gave it a genuinely hard problem: keeping a hospital's beds flowing. The result — I called it the Patient Flow Orchestrator — watches a simulated ward, predicts where beds will run short over the next few hours, works out what each ready-to-leave patient is stuck on, and proposes a ranked list of fixes for a human to approve. One caveat up front: the hospital is a vehicle, not the point. I chose patient flow because it exercises every part of an agent while carrying zero clinical risk. Everything runs on a simulated ward with synthetic data: no real patients, and no clinical judgement anywhere in the system. Code: github.com/devdaviddr/ai-patient-flow-orchestrator The problem Hospitals run out of beds, and usually not because the building is full of the sickest people. It's because a handful of patients who are ready to go home are stuck. One is waiting on a pharmacy script. One needs hospital transport booked. One needs a care-home placement before they can leave. Every hour they wait, a bed that a new emergency patient could use stays occupied. That waiting time has a name: access block . Clearing it is mostly phone calls and chasing. Logistics, not medicine. A bed-flow coordinator does that chasing all day, holding the whole moving picture in their head and on the phone. Today's software is a bed-board: it shows the state but it can't think about it. It won't tell you that Ward 4B will be two beds short by 4pm, explain why three discharges are stuck, or line up the actions that close the gap. That gap, between showing and thinking , is where an agent earns its keep. I also wanted a problem that was hard in the right ways for learning to build agents . Most agent demos are toys: a single prompt, one tool call, no real consequences. Bed flow gives you changing state, multi-step planning, several tools, a human in the loop, and a result you can actually measure. The solution From the coordinator's seat, the agent feels like a sharp colleague who has already read the whole board: [09:15] Agent: Ward 4B will be 2 beds short by 15:30. 3 patients are ready to leave but stuck: • P-042: waiting on a pharmacy script • P-017: waiting on transport • P-091: waiting on a care placement Suggested fixes, most helpful first: 1. Expedite the script for P-042 (frees the surest discharge) 2. Book transport for P-017 (pickup window still open) [Approval card shown to the coordinator] Coordinator: ✓ approve fix 1 → Script expedited. P-042's bed will free up by 15:15. Agent: Gap is down to 1 bed. Approving transport for P-017 would close it. Each suggestion arrives as a card ranked by impact and tagged with the agent's confidence, the bed-time it would free, the patient it helps, and the agent's reasoning. Nothing happens until someone clicks Approve. How it works The whole thing is a TypeScript and Next.js app. It simulates a hospital (one emergency department plus two ten-bed wards), runs an AI agent over it, shows the suggestions as approval cards, and measures whether the agent actually helped. flowchart TB subgraph APP[&quot;My code, the app&quot;] UI[&quot;Bed-board · Approval cards · Results panel&quot;] LOOP[&quot;Loop driver&lt;br/&gt;owns the clock · one prompt per tick&quot;] end subgraph OC[&quot;OpenCode, the agent engine (configured, not coded)&quot;] ORCH[&quot;Orchestrator&lt;br/&gt;runs the ReAct loop · proposes fixes&quot;] subgraph SUBS[&quot;Helper agents · read-only&quot;] DIS[&quot;Discharge helper&lt;br/&gt;finds each blocker&quot;] DEM[&quot;Demand helper&lt;br/&gt;forecasts incoming patients&quot;] end GATE[&quot;Permission gate&lt;br/&gt;action tools need approval&quot;] ORCH --&gt;|&quot;delegate&quot;| DIS ORCH --&gt;|&quot;delegate&quot;| DEM ORCH --- GATE end MODELS[&quot;AI model (swappable)&lt;br/&gt;OpenCode Zen · hosted Claude · local Ollama&quot;] subgraph TOOLS[&quot;Tools, the only hospital-aware code&quot;] READ[&quot;Read tools&lt;br/&gt;world_state · forecast_discharges · forecast_demand&quot;] ACT[&quot;Action tools&lt;br/&gt;expedite_script · request_transport · …&quot;] end SIM[&quot;Simulator, the make-believe hospital&lt;br/&gt;holds the state · runs a seedable clock&quot;] UI --&gt; LOOP LOOP --&gt;|&quot;one prompt per tick&quot;| ORCH MODELS -.-&gt;|&quot;thinking&quot;| ORCH ORCH --&gt;|&quot;tool calls&quot;| READ DIS --&gt;|&quot;read&quot;| READ DEM --&gt;|&quot;read&quot;| READ GATE --&gt;|&quot;on approval&quot;| ACT READ --&gt;|&quot;read state&quot;| SIM ACT --&gt;|&quot;apply change&quot;| SIM SIM --&gt;|&quot;new state → next tick&quot;| LOOP GATE -.-&gt;|&quot;approval card&quot;| UI classDef appNode fill:#1e3a5f,stroke:#4a90d9,color:#e8f0fe classDef ocNode fill:#1f3d2b,stroke:#4caf72,color:#e6f4ea classDef toolNode fill:#3d3416,stroke:#d4a72c,color:#fdf3d6 classDef simNode fill:#3d1f1f,stroke:#d96a6a,color:#fde8e8 classDef modelNode fill:#2e2a4d,stroke:#9b8cff,color:#ece8ff class UI,LOOP appNode class ORCH,DIS,DEM,GATE ocNode class READ,ACT toolNode class SIM simNode class MODELS modelNode style APP fill:transparent,stroke:#4a90d9,color:#4a90d9 style OC fill:transparent,stroke:#4caf72,color:#4caf72 style SUBS fill:transparent,stroke:#6fce92,color:#6fce92 style TOOLS fill:transparent,stroke:#d4a72c,color:#d4a72c Three colours, three kinds of code, and the boundaries between them are the whole design: Blue is my code: the Next.js app, the loop that owns the clock, the simulator, and the screen the coordinator sees. Green is configured, not coded: OpenCode runs the agent and its two read-only helpers, all of which are Markdown config files rather than loop code. Amber is the tools: the only code that knows it's a hospital at all. Four decisions did most of the work. 1. Lean on a harness instead of writing the loop My first instinct was to write the agent loop by hand: call the model, parse what it wants, run a tool, feed the result back, repeat, then bolt on permissions and logging. I quickly realised that the loop — plus tool-calling, delegation, and an audit trail — is most of the work in any agent project. And none of it has anything to do with hospitals. So I reached for OpenCode , a ready-made harness. A harness is the engine that wraps a language model and turns it into a working agent: it runs the loop, calls the tools, delegates to helpers, enforces permissions, and records the session. OpenCode was built as a coding agent, but its building blocks are general. I configure it and write only the parts that are genuinely mine. The loop it runs is ReAct , short for Reason plus Act . A chatbot answers in one shot. An agent reasons about what it needs, acts by calling a tool, observes the result, then decides what comes next, and loops until it has enough to write a plan. flowchart TD START([&quot;Prompt: assess the ward&quot;]) --&gt; REASON[&quot;Reason&lt;br/&gt;what do I need next?&quot;] REASON --&gt; ACT[&quot;Act&lt;br/&gt;call a tool&quot;] ACT --&gt; OBSERVE[&quot;Observe&lt;br/&gt;read the tool's result&quot;] OBSERVE --&gt; DONE{&quot;Got enough&lt;br/&gt;to plan?&quot;} DONE --&gt;|&quot;No&quot;| REASON DONE --&gt;|&quot;Yes&quot;| PLAN([&quot;Write the ranked plan&quot;]) classDef node fill:#1f3d2b,stroke:#4caf72,color:#e6f4ea classDef edge fill:#1e3a5f,stroke:#4a90d9,color:#e8f0fe class REASON,ACT,OBSERVE node class START,PLAN,DONE edge There are really two loops here, one inside the other. OpenCode runs the inner ReAct loop within a single tick. My code runs the outer loop that owns the clock and re-prompts the agent each time the ward changes. Keeping the outer loop mine is what makes the system agentic over time, and it keeps the hospital's clock under my control rather than the model's. 2. Put every hospital detail behind tools The most important decision in the project: only the tools know it's a hospital. An AI tool anywhere near a hospital must never make a clinical call — no acuity, no triage, no diagnosis. The weak way to enforce that is to ask the model nicely in its prompt. The strong way is to make it structurally impossible. The agent calls plain functions like world_state and expedite_script , and every hospital detail lives behind those tools in the simulator. The read tools hand back beds, blockers, and timings, and nothing clinical. The model can't repeat a triage level it was never handed. 3. Own the approval gate in your own code The action tools are marked ask , which is meant to pause for a human. But there's a known OpenCode bug where that pause gets skipped when you drive it through the SDK, which is exactly how I drive it. Leaning on the engine would mean the most important safety property in the system depends on a third-party bug staying fixed. So the gate lives in my own code instead. The agent never changes anything; it writes a plan and stops. When the coordinator approves a fix, my code applies it directly in the simulator. There is no path from the AI to a bed. There's one more lock: the agent reaches the simulator with its own read-only service token. Even if the model went completely haywire, it simply cannot call an action route. The only way a bed ever changes is a human clicking Approve. 4. Make the safety line a test, not a hope The gate stops the agent from doing clinical things, but not from saying them; it could still write a triage word into its plan text. A prompt instruction like never make a clinical judgement is a hope; it might hold today and break after a model update. Because no tool ever feeds the agent a clinical concept, the agent has nothing clinical to repeat, and a test proves it. On every push, CI scans every plan, answer, and saved record for a long list of triage levels, acuity scores, and diagnostic and treatment words. If any of them ever appears, the build goes red. Safety becomes a property of the system's shape. Does it actually help? A demo proves the agent works once. It doesn't prove it helps in general, or that it isn't quietly making things worse. The only honest test is to run the same day twice, once with the agent's approved fixes and once with no agent at all, then compare. The key move: seed the world, not the model. The simulator is seedable, so the same scenario replays identically. That makes the agent the only difference between the two runs, so any change in the numbers is down to it. (Seeding the model would fake determinism in the wrong place, so I never do it.) Each run is scored on two numbers: Access-block hours: total time patients spend waiting for a bed. Lower is better. End-of-day headroom: clean, empty beds left at day's end. Higher is better. The results, across a normal weekday and a flu surge: Day Access-block hours (lower is better) Beds free at end (higher is better) Normal weekday 14.5 → 9.0 (−38%) 1 → 2 Flu surge 84.5 → 44.5 (−47%) 1 → 2 The agent helps on both numbers, on both days. On the flu-surge day it clears about 40 bed-hours of waiting — roughly four patients spared a night stuck in the emergency department. The stack Layer Choice Why Agent runtime OpenCode ( opencode serve , headless) Built-in ReAct loop, helper agents, permission config, model-swap, and session audit. Everything an agent needs, off the shelf. App + backend Next.js (App Router) + the SDK Drives one prompt per tick, and keeps the clock and the world mine. Environment In-process simulator Typed events on a seedable clock. The only stateful, hospital-aware component. Forecasts Transparent heuristic behind a tool You can always see why it predicted what it did, so the AI stays the star rather than a black-box forecaster. Tools TypeScript tool() + Zod Type-safe args, native to OpenCode's permission system. Models OpenCode Zen free tier by default, swappable to hosted Claude or local Ollama Zero-key, zero-cost demo, with portability and no code change. Auth Better Auth + SQLite Self-hosted accounts, server-side sessions, invite-gated sign-up, and viewer/coordinator/superadmin roles. Tests Vitest + Playwright 250+ fast, deterministic tests. The safety and approval guarantees are tested, not assumed. It ships self-hosted: opencode serve plus the Next.js server, with the simulator in-process. Ingress is a Cloudflare Tunnel, so the app port is never published and the origin isn't directly routable. The default model is the OpenCode Zen free tier: no API key, no local server, no cost. Clone it and it runs. What I learned The hospital was never really the point. The shape is. An agent that watches a changing world, reasons over it with tools, proposes ranked actions, and stops for a human before anything irreversible is a pattern that travels well. Swap the simulator for a warehouse and you have a stock-rebalancing assistant. Swap the discharge forecast for project deadlines and you have a planning co-pilot. Swap the approval card for a Slack message and you have a human-in-the-loop assistant for any job where an AI suggests and a person decides. The four decisions above are the pattern, and they're coming with me to the next project: rent the harness, blindfold the model with tools, own the gate, seed the world. Where it goes next A few directions I'd take it further: Richer scenarios. More than a weekday and a flu surge (weekend staffing dips, mass-casualty spikes, seasonal patterns) to stress the agent against situations it hasn't seen. Smarter forecasting. The discharge and demand forecasts are deliberately simple, transparent heuristics. Swapping in a learned model (still behind the same tool boundary) would test how far the agent's reasoning holds when the inputs get noisier. Learning from approvals. Right now every tick starts fresh. Feeding back which suggestions coordinators accept or reject would let the ranking adapt to how a given ward actually works. A real integration boundary. The simulator speaks a small HTTP surface on purpose. Pointing the same tools at a sandboxed, fully synthetic copy of a real bed-management system (never live patient data) would test the design against messier inputs. Hospital-wide view. Two wards is enough to prove the loop; coordinating across an entire site, where freeing a bed in one place creates pressure in another, is the harder and more interesting version of the problem. Diagrams The four diagrams that map the system, collected in one place. Component architecture The tick cycle How a blocked patient gets unstuck The bed lifecycle Source Full source: github.com/devdaviddr/ai-patient-flow-orchestrator ."
  },
  {
    "slug": "2026-06-25-bff-architecture-fullstack-app-azure",
    "title": "BFF Architecture on Azure Container Apps",
    "description": "A walkthrough of the Backend-for-Frontend pattern on Azure — single Container App Environment, one external ingress, internal-only backend services, and server-side token handling with HttpOnly session cookies.",
    "tags": [
      "azure",
      "bff",
      "architecture",
      "fullstack"
    ],
    "excerpt": "TL;DR The BFF pattern puts a thin API layer between your frontend and your backend microservices. Your React SPA talks to one endpoint. The BFF handles auth, aggregation, and data shaping — so your backend services stay general-purpose and your front",
    "content": "TL;DR The BFF pattern puts a thin API layer between your frontend and your backend microservices. Your React SPA talks to one endpoint. The BFF handles auth, aggregation, and data shaping — so your backend services stay general-purpose and your frontend never fetches data it doesn't need. This article walks through one on Azure Container Apps: a single Container App Environment, one external ingress on the BFF, three internal-only backend services, and server-side token handling via Entra ID. The Backend-for-Frontend (BFF) pattern solves a specific problem: your backend APIs should not be shaped by your frontend's rendering needs. When your web client, mobile app, and third-party integrations all talk to the same backend endpoints, one of two things happens — either every response grows to include data nobody asked for, or each client has to stitch together several calls to render a single screen. A BFF is a per-client middleware layer. It lives between your frontend and your backend services and handles the transformations specific to that client. Your backend services stay pure — they expose domain operations. Your frontend stays fast — it makes one call per view. What problem does BFF solve? Without a BFF, every frontend feature that needs data from multiple services triggers multiple round trips from the browser: %%{init: {'theme': 'dark'}}%% flowchart LR subgraph without[No BFF — Multiple Round Trips] browser([&quot;Browser (React SPA)&quot;]) api1[&quot;/api/products&quot;] api2[&quot;/api/orders&quot;] api3[&quot;/api/users&quot;] browser --&gt; api1 browser --&gt; api2 browser --&gt; api3 end classDef problem fill:#fee2e2,stroke:#dc2626,color:#15171a classDef normal fill:#dbeafe,stroke:#2563eb,color:#15171a class api1,api2,api3 problem class browser normal This creates three compounding problems: Latency. The browser pays the round-trip cost separately for each service. On mobile networks, three serial calls can add seconds to a page load. Over-fetching. Each backend service returns its full domain model. The frontend only needs a handful of fields from each — but it has to download everything anyway. Distributed auth. Every backend service must independently validate the user's token. If your auth scheme changes, every service needs a coordinated update. With a BFF in between: %%{init: {'theme': 'dark'}}%% flowchart LR subgraph with[BFF — Single Round Trip] browser2[&quot;Browser (React SPA)&quot;] bff[&quot;BFF&lt;br/&gt;(Express on Container App)&quot;] apib1[&quot;Products API&quot;] apib2[&quot;Orders API&quot;] apib3[&quot;Users API&quot;] browser2 --&gt;|one call| bff bff --&gt; apib1 bff --&gt; apib2 bff --&gt; apib3 end classDef bffClass fill:#fef3c7,stroke:#d97706,color:#15171a classDef normal fill:#dbeafe,stroke:#2563eb,color:#15171a class bff bffClass class browser2,apib1,apib2,apib3 normal The browser makes one call. The BFF orchestrates the backend calls server-side — fast, within the Azure data centre — aggregates the responses, trims the payload to exactly what the component needs, and sends it back. Auth is validated once, at the BFF boundary. Architecture overview Everything runs inside a single Container App Environment (CAE). This is Azure's managed hosting boundary for groups of container apps that share a virtual network, internal DNS, and observability infrastructure. %%{init: {'theme': 'dark'}}%% flowchart TB subgraph cae[&quot;Container App Environment&quot;] direction TB spa[&quot;React SPA&lt;br/&gt;Container App&quot;] bff[&quot;BFF — Express + Node.js&lt;br/&gt;Container App&lt;br/&gt;🛡️ External ingress&quot;] subgraph internal[&quot;Internal-only backend services&quot;] products[&quot;Products&lt;br/&gt;Container App&quot;] orders[&quot;Orders&lt;br/&gt;Container App&quot;] users[&quot;Users&lt;br/&gt;Container App&quot;] end spa --&gt;|&quot;HttpOnly session cookie&lt;br/&gt;(no tokens in browser)&quot;| bff bff --&gt;|internal CAE DNS| products bff --&gt;|internal CAE DNS| orders bff --&gt;|internal CAE DNS| users end entraid((&quot;Entra ID&lt;br/&gt;Identity Provider&quot;)) -.-&gt;|OIDC code exchange| bff classDef caeClass fill:#f0f7ff,stroke:#2563eb,color:#15171a classDef bffClass fill:#fef3c7,stroke:#d97706,color:#15171a classDef internalClass fill:#fef9e7,stroke:#f59e0b,color:#15171a classDef idpClass fill:#eef2f7,stroke:#6b7280,color:#15171a class cae,internal caeClass class bff bffClass class products,orders,users internalClass class entraid idpClass The key design decisions: Only the BFF has external ingress. The three backend services are configured with internal ingress only — they have no public endpoint and cannot be reached from outside the CAE. The browser can never call them directly. The React SPA is also a container app. It sits inside the same environment and communicates with the BFF via an HttpOnly session cookie — no tokens ever reach the browser. Backend services reach each other over internal CAE DNS. No API Management layer, no load balancer between the BFF and its downstream services — just private hostnames within the shared virtual network. All container apps inside a single CAE, one external ingress on the BFF, internal-only backend services, and token isolation at the SPA↔BFF boundary. The auth pattern — tokens stay server-side The most important thing the BFF does for security is keep tokens off the browser entirely. Here is how that works: Login (OIDC code exchange). When a user signs in, the BFF performs the full OAuth 2.0 authorization code flow with Entra ID server-side. Entra ID issues an access token and a refresh token — but both are stored in the BFF's server-side session store (backed by something like Redis in production), never sent to the browser. Session cookie. The BFF issues the browser a single HttpOnly session cookie. This cookie contains only an opaque session identifier — a random string that maps to the server-side session. Because it is HttpOnly, JavaScript running in the browser cannot read it. Because it is not a token, there is nothing an attacker can decode if it leaks. Per-request validation. On every subsequent request, the browser sends the session cookie. The BFF looks up the session ID, retrieves the stored access token, validates it, and — if the token has expired — silently refreshes it using the stored refresh token. The browser never participates in the refresh flow and never sees the new token. Backend services. When the BFF calls a downstream service, it can attach the access token as a bearer header on the internal request. The backend service validates the token normally. The crucial point: the token travels only over the private CAE network, never over the public internet and never through the browser. This pattern is sometimes called the Token Handler pattern. The BFF acts as a secure token proxy — the client thinks in sessions, the backend thinks in tokens, and the BFF bridges the two. How the BFF aggregates and shapes data The BFF's other job is eliminating the N+1 round-trip problem. Consider a dashboard that needs the user's profile, their recent orders, and a product list to resolve order line items. Without a BFF, the browser makes three separate requests and joins the data client-side. With a BFF: The browser makes one request to the BFF's dashboard endpoint, sending only the session cookie. The BFF fans out all three backend calls in parallel — it does not wait for Products to respond before calling Orders. Total network time is determined by the slowest call, not the sum of all three. The BFF joins the results — matching order line items to product names — entirely server-side over internal CAE DNS, which is orders of magnitude faster than a browser doing the same join over the public internet. The BFF returns a single shaped payload containing exactly the fields the dashboard component renders: a greeting, an order count, a short list of recent orders with resolved product names, and a cart item count. Nothing more. The frontend component receives a pre-joined view model. It does not filter, it does not join, it does not page through a generic list. This is what keeps frontend components simple and avoids the proliferation of client-side data-transformation logic. Request flow %%{init: {'theme': 'dark'}}%% sequenceDiagram participant Browser participant BFF as BFF (Container App) participant Products participant Orders participant Users participant Entra as Entra ID Browser-&gt;&gt;BFF: GET /api/bff/dashboard (HttpOnly cookie) Note over BFF: Validate session opt Token expired BFF-&gt;&gt;Entra: POST /token (refresh grant) Entra--&gt;&gt;BFF: new access token end par Fetch backend data (internal CAE DNS) BFF-&gt;&gt;Products: GET http://products/api/products Products--&gt;&gt;BFF: product list BFF-&gt;&gt;Orders: GET http://orders/api/orders?userId=x Orders--&gt;&gt;BFF: user orders BFF-&gt;&gt;Users: GET http://users/api/users/x/profile Users--&gt;&gt;BFF: profile end Note over BFF: Aggregate + shape BFF--&gt;&gt;Browser: {greeting, orderCount, recentOrders, cartItemCount} Notice that the BFF only contacts Entra ID inside the opt block — and only to refresh an expired access token, not on every request. The initial OIDC code exchange happened once, at login. The rest of the time the BFF validates the session locally and proceeds straight to the backend calls. Internal networking — no gateway needed A Container App Environment gives every container app inside it a private DNS name matching its app name. The BFF can reach the Products service simply by calling http://products — Azure resolves that to the correct internal endpoint automatically. This has a few useful consequences: No API Management layer required between the BFF and its backend services. Within the CAE, service-to-service calls are private, cheap, and fast. No service discovery configuration. App names are hostnames. If you rename a service, update the BFF's reference to that hostname — nothing else changes. Backend services are not addressable from outside the CAE. Their ingress is set to internal-only. There is no public IP, no DNS record outside the environment, and no way for an external client to reach them even if they tried. The only surface exposed to the internet is the BFF's external ingress endpoint. Azure Container Apps managed ingress handles TLS termination and custom domain binding for that endpoint. There is no WAF by default — if your compliance requirements demand one (HIPAA, SOC 2, PCI), you would place Azure Front Door or Application Gateway in front of the BFF. Caching at the BFF layer Because the BFF aggregates calls that the browser used to make individually, it is also in the ideal position to cache them. Data that changes infrequently — product catalogues, user profiles — can be cached in the BFF's memory (or in a Redis cache for multi-instance deployments) with a short TTL, so repeated frontend requests do not re-hit the backend services on every page load. The caching strategy should be selective: Static-ish data (product lists, reference data): a few minutes TTL is safe. User-specific data (profile, recent orders): either a very short TTL or no caching — and always scope the cache key to the user's identity, not just the URL, to avoid serving one user's data to another. Personalised aggregate views (the dashboard itself): do not cache the assembled view model, since it is composed of user-specific data. Cache the individual downstream calls that feed into it instead. Deploying on Azure Container Apps The infrastructure is straightforward in Bicep. You define one managed environment resource — this is the CAE that hosts everything. Then you define one container app per service, all pointing at the same environment. The critical difference between the BFF and the backend services is the ingress configuration: BFF: ingress.external = true , which gives it a public HTTPS endpoint that the browser and Entra ID can reach. Products, Orders, Users: ingress.external = false , which makes them reachable only within the CAE over internal DNS. The BFF container app also receives the Entra tenant ID and client ID as environment variables. These are used to construct the OIDC endpoints and to validate tokens — they are not secrets (the client secret is), so passing them as plain env vars in the Bicep template is fine. The client secret itself should be stored in Azure Key Vault and injected via a Key Vault reference, not hardcoded. Deploying is a single az deployment group create command pointing at the Bicep template, passing the app name, Entra tenant ID, and Entra client ID as parameters. When BFF is the right call Scenario BFF helps? Why Web + mobile sharing backend APIs Yes Each client gets its own BFF, shaped for its rendering model Legacy monolith with a new SPA frontend Yes The BFF acts as an adapter — you don't refactor the monolith Simple CRUD app, one frontend No The extra layer adds latency and complexity for no benefit Public API with third-party consumers No Those consumers want the full domain model, not a view model Microservices with per-screen orchestration Yes Prevents N+1 round trips from the browser to N services Rule of thumb: if your frontend makes three or more API calls to render one screen, or you have multiple distinct client types, a BFF pays for itself in reduced complexity. If you have one frontend and one backend, skip it. Common pitfalls BFF becomes a monolith. Each route handler in the BFF should be thin — orchestration only, no business logic. Business rules stay in the backend services. If the BFF is making decisions about pricing or inventory, something has gone wrong with the boundary. Shared BFF for web and mobile. The whole point is per-client specialisation. A shared BFF defeats the purpose — a mobile home screen needs a different data shape than a web dashboard. Run separate BFF container apps, one per client type. No timeouts on backend calls. If one backend service hangs, it will hold the BFF handler open for the duration of the default HTTP timeout. Every downstream call from the BFF should have an explicit timeout. Partial failure handling — returning a degraded response when one service is slow — is also worth building in early. Accidentally exposing a backend service. It is easy to deploy a new container app and forget to set its ingress to internal-only. If a backend service gets an external ingress, the entire token-isolation design is bypassed — the browser could call it directly. A deployment policy or CI gate that audits ingress configuration is worth adding before you go to production. No WAF by default. Azure Container Apps managed ingress provides TLS termination and custom domains, but not an application-layer firewall. If your workload requires WAF protection, place Azure Front Door or Application Gateway in front of the BFF and configure WAF rules there. Over-fetching still happens inside the BFF. Backend calls within the CAE may return large objects when the BFF only needs a few fields. This is usually fine — the call is fast and internal — but worth keeping an eye on if payload sizes grow. The backend service is the right place to add field projection if it becomes a problem. Further reading Backends for Frontends pattern — the canonical write-up in the Azure Architecture Center. Azure Container Apps overview — what a Container App Environment is and how apps run inside it. Ingress in Azure Container Apps — external vs. internal ingress, the flag that keeps backend services private. Microsoft identity platform — OAuth 2.0 authorization code flow — the Entra ID flow the BFF runs server-side at login. The Token Handler pattern — the security pattern behind keeping tokens server-side and handing the browser only a cookie."
  },
  {
    "slug": "2026-05-24-building-an-ai-agent-from-scratch-ollama-python",
    "title": "Tutorial: Build an AI Agent from Scratch with Ollama and Python",
    "description": "A from-scratch tutorial that builds a local personal-planner agent in plain Python, backed by Ollama and SQLite. No frameworks. Each section breaks the previous version to motivate the next fundamental: tools, short-term memory, long-term memory, planning, and reflection.",
    "tags": [
      "ai",
      "ollama",
      "python",
      "agents",
      "tutorial",
      "from-scratch"
    ],
    "excerpt": "TL;DR ~500 lines of plain Python, no frameworks, local Ollama model, one SQLite file. Each section breaks the previous version to introduce the next fundamental: tools → short-term memory → long-term memory → planning → reflection . By the end you ha",
    "content": "TL;DR ~500 lines of plain Python, no frameworks, local Ollama model, one SQLite file. Each section breaks the previous version to introduce the next fundamental: tools → short-term memory → long-term memory → planning → reflection . By the end you have a working personal planner and a clear mental model of what agent frameworks do for you. Code: github.com/devdaviddr/personal-planner-agent This tutorial builds a small AI agent from scratch in plain Python. It runs against a local Ollama model, stores everything in SQLite, and uses no frameworks. We start with a single LLM call and layer on the four patterns that make it an agent: tools, memory, planning, and reflection. By the end you will have: A local personal-planner agent you talk to from the terminal. It can add, list, complete, update, and delete tasks; it remembers facts about you across sessions; it plans before acting and reflects on what it did. A working mental model of what an agent framework actually does for you, so you can decide when to reach for one and when not to. Roughly 500 lines of Python, all stdlib plus the ollama client. All the code lives at github.com/devdaviddr/personal-planner-agent if you want to clone-and-run before reading. A taste of what the finished agent looks like in use — note that it remembers a preference you told it weeks earlier and uses it to answer a question that has nothing to do with the original turn: you&gt; i have wednesday afternoons free for meetings bot&gt; Noted. # ...some time later, new session... you&gt; when should i schedule the dentist? bot&gt; Plan: - recall any free-time preferences - resolve &quot;next Wednesday&quot; via get_today - propose a date [recall(&quot;when is the user free for appointments&quot;) → &quot;Wednesday afternoons&quot;] [get_today() → 2026-05-24] You mentioned Wednesday afternoons are free. Next Wednesday is 2026-05-27 — want me to add it? What is an agent, really? Strip the term down and an agent is four things in a loop: An LLM that picks the next action. A set of tools the LLM can call (functions, basically). Memory so it carries state across turns and sessions. A control loop that keeps calling the LLM until it's done. Everything else — planning, reflection, multi-agent orchestration, retrieval — is a refinement of one of those four. The shape of the loop itself (think → act → observe → think again) is often called ReAct , and it's the load-bearing structure of every agent system, from one-file scripts to multi-agent orchestration platforms. The rest of this article introduces each refinement only after showing what visibly breaks without it. What you will build A single Python program. It opens a terminal REPL, persists everything to a local planner.db SQLite file, and talks to Ollama for both chat completions and embeddings. flowchart LR user([&quot;you&lt;br/&gt;(terminal REPL)&quot;]) agent[&quot;agent loop&lt;br/&gt;(Python)&quot;] ollama[(&quot;Ollama&lt;br/&gt;qwen3.5:9b / 4b&lt;br/&gt;nomic-embed-text&quot;)] db[(&quot;SQLite&lt;br/&gt;planner.db&quot;)] user &lt;--&gt; agent agent &lt;--&gt;|chat + embeddings| ollama agent &lt;--&gt;|tasks · messages · memories| db classDef external fill:#eef2f7,stroke:#6b7280,color:#15171a classDef internal fill:#dbeafe,stroke:#2563eb,color:#15171a class user,ollama external class agent,db internal SQLite holds three tables: tasks — the planner's domain data (title, due date, status). messages — conversation history, one row per turn, keyed by session. memories — long-term facts about you, stored with an embedding for retrieval. That's the whole system. We will build it up one layer at a time. Prerequisites Requirement Notes Python 3.11+ Used for `str Ollama Install from https://ollama.com . CPU works but is slow; a GPU with 8+ GB VRAM makes the experience interactive. The ollama Python client pip install 'ollama&gt;=0.4' . The only third-party dependency. Older clients expose embeddings() instead of embed() and will KeyError on the code below. A terminal We will run a REPL. Any shell. Pull the three models we will use: ollama pull qwen3.5:9b ollama pull qwen3.5:4b ollama pull nomic-embed-text ollama pull qwen3.5:9b ollama pull qwen3.5:4b ollama pull nomic-embed-text qwen3.5:9b is the main agent model — reliable for tool calling, planning, and generation. qwen3.5:4b is a smaller, faster model reserved for the reflection critic pass; it only needs to judge yes/no, so a smaller model is sufficient. If you have less VRAM, llama3.2:3b works for the main model but trips on tool schemas more often. nomic-embed-text is a 768-dim embedding model used for long-term memory retrieval. The 30-line naive agent Start with the smallest possible thing that calls an LLM: # v1_naive.py import ollama MODEL = \"qwen3.5:9b\" def chat (prompt: str ) -> str : res = ollama.chat( model = MODEL , messages = [{ \"role\" : \"user\" , \"content\" : prompt}], ) return res[ \"message\" ][ \"content\" ] if __name__ == \"__main__\" : while (user := input ( \"you> \" ).strip()): print ( f \"bot> { chat(user) }\\n \" ) # v1_naive.py import ollama MODEL = \"qwen3.5:9b\" def chat (prompt: str ) -> str : res = ollama.chat( model = MODEL , messages = [{ \"role\" : \"user\" , \"content\" : prompt}], ) return res[ \"message\" ][ \"content\" ] if __name__ == \"__main__\" : while (user := input ( \"you> \" ).strip()): print ( f \"bot> { chat(user) }\\n \" ) Run it: you&gt; what's the capital of France? bot&gt; Paris. you&gt; add a task to buy milk tomorrow bot&gt; Sure! I've added &quot;buy milk&quot; to your task list for tomorrow. The second reply is a lie. There is no task list. There is no tomorrow — the model has no way to do anything beyond producing text. It also forgets the previous turn the moment the next one starts, because we send a fresh single-message history every time. This is the baseline. Everything from here on is fixing a specific failure of this version. Tools — letting the agent do things A tool is a Python function the LLM can decide to call. The agent loop is responsible for advertising those functions to the model (as JSON Schema), watching for tool_calls in the response, executing them, and feeding the results back. The loop never decides what to do — the model does. The loop is purely mechanical: it dispatches whatever the model asks for, feeds the result back, and asks again. The model stops by producing a turn with no tool calls. This is the single most important idea in this section; everything below is plumbing for it. Define a small registry: # tools.py import json, sqlite3 from datetime import date TOOLS : dict[ str , dict ] = {} def tool (name: str , description: str , schema: dict ): def decorator (fn): TOOLS [name] = { \"description\" : description, \"schema\" : schema, \"fn\" : fn} return fn return decorator def tool_specs () -> list[ dict ]: return [ { \"type\" : \"function\" , \"function\" : { \"name\" : name, \"description\" : t[ \"description\" ], \"parameters\" : t[ \"schema\" ], }, } for name, t in TOOLS .items() ] # tools.py import json, sqlite3 from datetime import date TOOLS : dict[ str , dict ] = {} def tool (name: str , description: str , schema: dict ): def decorator (fn): TOOLS [name] = { \"description\" : description, \"schema\" : schema, \"fn\" : fn} return fn return decorator def tool_specs () -> list[ dict ]: return [ { \"type\" : \"function\" , \"function\" : { \"name\" : name, \"description\" : t[ \"description\" ], \"parameters\" : t[ \"schema\" ], }, } for name, t in TOOLS .items() ] This section adds five task tools and a get_today . Two more tools — remember and recall for long-term memory — come in the next section. We also define READ_ONLY_TOOLS here since it lives in tools.py ; it's a frozenset marker, not a callable tool — the reflection loop uses it to skip the critic pass when only non-mutating tools were called in a turn: READ_ONLY_TOOLS : frozenset[ str ] = frozenset ({ \"list_tasks\" , \"get_today\" , \"recall\" , }) READ_ONLY_TOOLS : frozenset[ str ] = frozenset ({ \"list_tasks\" , \"get_today\" , \"recall\" , }) The check_same_thread=False on the connection is required because the concurrent tool dispatch introduced in the reflection section uses a ThreadPoolExecutor — SQLite serialises commits internally so this is safe with a single writer process. db = sqlite3.connect( \"planner.db\" , check_same_thread = False ) db.row_factory = sqlite3.Row db.executescript( \"\"\" CREATE TABLE IF NOT EXISTS tasks ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, due_date TEXT, status TEXT NOT NULL DEFAULT 'open', created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP ); \"\"\" ) @tool ( \"add_task\" , \"Add a task to the planner.\" , { \"type\" : \"object\" , \"properties\" : { \"title\" : { \"type\" : \"string\" }, \"due_date\" : { \"type\" : \"string\" , \"description\" : \"ISO 8601 date, e.g. 2026-05-28\" }, }, \"required\" : [ \"title\" ], }) def add_task (title: str , due_date: str | None = None ) -> dict : if due_date is not None : date.fromisoformat(due_date) # reject hallucinated dates loudly cur = db.execute( \"INSERT INTO tasks (title, due_date) VALUES (?, ?)\" , (title, due_date), ) db.commit() return { \"id\" : cur.lastrowid, \"title\" : title, \"due_date\" : due_date} @tool ( \"list_tasks\" , \"List tasks. Defaults to open; pass status='done' or status='all' for others.\" , { \"type\" : \"object\" , \"properties\" : { \"status\" : { \"type\" : \"string\" , \"enum\" : [ \"open\" , \"done\" , \"all\" ], \"default\" : \"open\" }, }, }) def list_tasks (status: str = \"open\" ) -> list[ dict ]: if status == \"all\" : rows = db.execute( \"SELECT id, title, due_date, status FROM tasks \" \"ORDER BY status, due_date IS NULL, due_date\" , ).fetchall() else : rows = db.execute( \"SELECT id, title, due_date, status FROM tasks WHERE status = ? \" \"ORDER BY due_date IS NULL, due_date\" , (status,), ).fetchall() return [ dict (r) for r in rows] @tool ( \"complete_task\" , \"Mark a task complete.\" , { \"type\" : \"object\" , \"properties\" : { \"id\" : { \"type\" : \"integer\" }}, \"required\" : [ \"id\" ], }) def complete_task (id: int ) -> dict : cur = db.execute( \"UPDATE tasks SET status = 'done' WHERE id = ?\" , ( id ,)) db.commit() if cur.rowcount == 0 : return { \"error\" : f \"no task with id= {id} \" } return { \"id\" : id , \"status\" : \"done\" } @tool ( \"update_task\" , \"Rename a task or change its due date. Pass only the fields you want changed.\" , { \"type\" : \"object\" , \"properties\" : { \"id\" : { \"type\" : \"integer\" }, \"title\" : { \"type\" : \"string\" }, \"due_date\" : { \"type\" : \"string\" , \"description\" : \"ISO 8601 date or empty string to clear\" }, }, \"required\" : [ \"id\" ], }) def update_task (id: int , title: str | None = None , due_date: str | None = None ) -> dict : sets, params = [], [] if title is not None : sets.append( \"title = ?\" ) params.append(title) if due_date is not None : if due_date != \"\" : date.fromisoformat(due_date) sets.append( \"due_date = ?\" ) params.append(due_date or None ) if not sets: return { \"error\" : \"nothing to update\" } params.append( id ) cur = db.execute( f \"UPDATE tasks SET { ', ' .join(sets) } WHERE id = ?\" , params) db.commit() if cur.rowcount == 0 : return { \"error\" : f \"no task with id= {id} \" } return { \"id\" : id , \"updated\" : True } @tool ( \"delete_task\" , \"Delete a task permanently.\" , { \"type\" : \"object\" , \"properties\" : { \"id\" : { \"type\" : \"integer\" }}, \"required\" : [ \"id\" ], }) def delete_task (id: int ) -> dict : cur = db.execute( \"DELETE FROM tasks WHERE id = ?\" , ( id ,)) db.commit() if cur.rowcount == 0 : return { \"error\" : f \"no task with id= {id} \" } return { \"id\" : id , \"deleted\" : True } @tool ( \"get_today\" , \"Get today's date in ISO 8601.\" , { \"type\" : \"object\" , \"properties\" : {}}) def get_today () -> dict : return { \"date\" : date.today().isoformat()} db = sqlite3.connect( \"planner.db\" , check_same_thread = False ) db.row_factory = sqlite3.Row db.executescript( \"\"\" CREATE TABLE IF NOT EXISTS tasks ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, due_date TEXT, status TEXT NOT NULL DEFAULT 'open', created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP ); \"\"\" ) @tool ( \"add_task\" , \"Add a task to the planner.\" , { \"type\" : \"object\" , \"properties\" : { \"title\" : { \"type\" : \"string\" }, \"due_date\" : { \"type\" : \"string\" , \"description\" : \"ISO 8601 date, e.g. 2026-05-28\" }, }, \"required\" : [ \"title\" ], }) def add_task (title: str , due_date: str | None = None ) -> dict : if due_date is not None : date.fromisoformat(due_date) # reject hallucinated dates loudly cur = db.execute( \"INSERT INTO tasks (title, due_date) VALUES (?, ?)\" , (title, due_date), ) db.commit() return { \"id\" : cur.lastrowid, \"title\" : title, \"due_date\" : due_date} @tool ( \"list_tasks\" , \"List tasks. Defaults to open; pass status='done' or status='all' for others.\" , { \"type\" : \"object\" , \"properties\" : { \"status\" : { \"type\" : \"string\" , \"enum\" : [ \"open\" , \"done\" , \"all\" ], \"default\" : \"open\" }, }, }) def list_tasks (status: str = \"open\" ) -> list[ dict ]: if status == \"all\" : rows = db.execute( \"SELECT id, title, due_date, status FROM tasks \" \"ORDER BY status, due_date IS NULL, due_date\" , ).fetchall() else : rows = db.execute( \"SELECT id, title, due_date, status FROM tasks WHERE status = ? \" \"ORDER BY due_date IS NULL, due_date\" , (status,), ).fetchall() return [ dict (r) for r in rows] @tool ( \"complete_task\" , \"Mark a task complete.\" , { \"type\" : \"object\" , \"properties\" : { \"id\" : { \"type\" : \"integer\" }}, \"required\" : [ \"id\" ], }) def complete_task (id: int ) -> dict : cur = db.execute( \"UPDATE tasks SET status = 'done' WHERE id = ?\" , ( id ,)) db.commit() if cur.rowcount == 0 : return { \"error\" : f \"no task with id= {id} \" } return { \"id\" : id , \"status\" : \"done\" } @tool ( \"update_task\" , \"Rename a task or change its due date. Pass only the fields you want changed.\" , { \"type\" : \"object\" , \"properties\" : { \"id\" : { \"type\" : \"integer\" }, \"title\" : { \"type\" : \"string\" }, \"due_date\" : { \"type\" : \"string\" , \"description\" : \"ISO 8601 date or empty string to clear\" }, }, \"required\" : [ \"id\" ], }) def update_task (id: int , title: str | None = None , due_date: str | None = None ) -> dict : sets, params = [], [] if title is not None : sets.append( \"title = ?\" ) params.append(title) if due_date is not None : if due_date != \"\" : date.fromisoformat(due_date) sets.append( \"due_date = ?\" ) params.append(due_date or None ) if not sets: return { \"error\" : \"nothing to update\" } params.append( id ) cur = db.execute( f \"UPDATE tasks SET { ', ' .join(sets) } WHERE id = ?\" , params) db.commit() if cur.rowcount == 0 : return { \"error\" : f \"no task with id= {id} \" } return { \"id\" : id , \"updated\" : True } @tool ( \"delete_task\" , \"Delete a task permanently.\" , { \"type\" : \"object\" , \"properties\" : { \"id\" : { \"type\" : \"integer\" }}, \"required\" : [ \"id\" ], }) def delete_task (id: int ) -> dict : cur = db.execute( \"DELETE FROM tasks WHERE id = ?\" , ( id ,)) db.commit() if cur.rowcount == 0 : return { \"error\" : f \"no task with id= {id} \" } return { \"id\" : id , \"deleted\" : True } @tool ( \"get_today\" , \"Get today's date in ISO 8601.\" , { \"type\" : \"object\" , \"properties\" : {}}) def get_today () -> dict : return { \"date\" : date.today().isoformat()} Now the agent loop: # agent.py import json, ollama from tools import TOOLS , tool_specs MODEL = \"qwen3.5:9b\" MAX_TURNS = 4 def to_dict (msg) -> dict : # Ollama returns Pydantic models. Convert to plain dicts so json.dumps # (and later, SQLite storage) work without a custom encoder. return msg.model_dump( exclude_none = True ) if hasattr (msg, \"model_dump\" ) else msg def run (user_input: str ) -> str : messages = [{ \"role\" : \"user\" , \"content\" : user_input}] for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ) for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} messages.append({ \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, }) return \"I hit my tool-call limit.\" # agent.py import json, ollama from tools import TOOLS , tool_specs MODEL = \"qwen3.5:9b\" MAX_TURNS = 4 def to_dict (msg) -> dict : # Ollama returns Pydantic models. Convert to plain dicts so json.dumps # (and later, SQLite storage) work without a custom encoder. return msg.model_dump( exclude_none = True ) if hasattr (msg, \"model_dump\" ) else msg def run (user_input: str ) -> str : messages = [{ \"role\" : \"user\" , \"content\" : user_input}] for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ) for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} messages.append({ \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, }) return \"I hit my tool-call limit.\" The shape of one turn: flowchart TD start([user message]) chat[&quot;ollama.chat&lt;br/&gt;(model + tools advertised)&quot;] decide{tool_calls&lt;br/&gt;in response?} finalize([return reply]) dispatch[execute tool fn] push[append result as&lt;br/&gt;role: tool message] start --&gt; chat chat --&gt; decide decide --&gt;|no| finalize decide --&gt;|yes| dispatch dispatch --&gt; push push --&gt; chat classDef step fill:#dbeafe,stroke:#2563eb,color:#15171a classDef terminal fill:#eef2f7,stroke:#6b7280,color:#15171a class chat,dispatch,push step class start,finalize terminal class decide step In code, &quot;the model decides&quot; looks like the conditional on calls : if the model returns no tool_calls , the loop returns its content . Otherwise it dispatches every call the model asked for, in order, and feeds each result back as a role: &quot;tool&quot; message before going around again. A turn now looks like: you&gt; add a task to buy milk by friday bot&gt; Done — added &quot;buy milk&quot; with due date 2026-05-29. The lie is gone. Real row in tasks , real due_date . What's still broken: start a new run of the program. Ask &quot;what tasks do I have?&quot; The model has to call list_tasks from a cold start every time because we discard messages at the end of run . Worse, even within a single REPL session, the next user message gets none of the previous turn's context. The agent has no memory. Short-term memory — conversation state Persist messages to SQLite, keyed by session id. Load them on each turn. Trim to a budget. CREATE TABLE IF NOT EXISTS messages ( id INTEGER PRIMARY KEY AUTOINCREMENT, session TEXT NOT NULL, role TEXT NOT NULL, content TEXT, tool_calls TEXT, tool_name TEXT, created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP ); # memory.py import json, sqlite3 HISTORY_LIMIT = 20 # rough turn budget def save_message (db: sqlite3.Connection, session: str , msg: dict ) -> None : db.execute( \"INSERT INTO messages (session, role, content, tool_calls, tool_name) \" \"VALUES (?, ?, ?, ?, ?)\" , ( session, msg[ \"role\" ], msg.get( \"content\" ), json.dumps(msg[ \"tool_calls\" ]) if msg.get( \"tool_calls\" ) else None , msg.get( \"tool_name\" ), ), ) db.commit() def load_history (db: sqlite3.Connection, session: str ) -> list[ dict ]: rows = db.execute( \"SELECT role, content, tool_calls, tool_name FROM messages \" \"WHERE session = ? ORDER BY id DESC LIMIT ?\" , (session, HISTORY_LIMIT ), ).fetchall() msgs = [] for r in reversed (rows): m = { \"role\" : r[ \"role\" ]} if r[ \"content\" ] is not None : m[ \"content\" ] = r[ \"content\" ] if r[ \"tool_calls\" ]: m[ \"tool_calls\" ] = json.loads(r[ \"tool_calls\" ]) if r[ \"tool_name\" ]: m[ \"tool_name\" ] = r[ \"tool_name\" ] msgs.append(m) return trim_to_user_boundary(msgs) def trim_to_user_boundary (msgs: list[ dict ]) -> list[ dict ]: # Tool-calling APIs require an assistant message with tool_calls to be # immediately followed by role: tool messages for each call. After a # window slice we have to guard both ends: # (1) start on a user message — drop any orphan tool/assistant prefix # (2) drop trailing orphans: a role:tool with no live assistant before # it, or an assistant whose tool_calls were never answered. start = next ((i for i, m in enumerate (msgs) if m[ \"role\" ] == \"user\" ), None ) if start is None : return [] msgs = msgs[start:] while msgs and ( msgs[ - 1 ].get( \"role\" ) == \"tool\" or (msgs[ - 1 ].get( \"role\" ) == \"assistant\" and msgs[ - 1 ].get( \"tool_calls\" )) ): msgs.pop() return msgs # memory.py import json, sqlite3 HISTORY_LIMIT = 20 # rough turn budget def save_message (db: sqlite3.Connection, session: str , msg: dict ) -> None : db.execute( \"INSERT INTO messages (session, role, content, tool_calls, tool_name) \" \"VALUES (?, ?, ?, ?, ?)\" , ( session, msg[ \"role\" ], msg.get( \"content\" ), json.dumps(msg[ \"tool_calls\" ]) if msg.get( \"tool_calls\" ) else None , msg.get( \"tool_name\" ), ), ) db.commit() def load_history (db: sqlite3.Connection, session: str ) -> list[ dict ]: rows = db.execute( \"SELECT role, content, tool_calls, tool_name FROM messages \" \"WHERE session = ? ORDER BY id DESC LIMIT ?\" , (session, HISTORY_LIMIT ), ).fetchall() msgs = [] for r in reversed (rows): m = { \"role\" : r[ \"role\" ]} if r[ \"content\" ] is not None : m[ \"content\" ] = r[ \"content\" ] if r[ \"tool_calls\" ]: m[ \"tool_calls\" ] = json.loads(r[ \"tool_calls\" ]) if r[ \"tool_name\" ]: m[ \"tool_name\" ] = r[ \"tool_name\" ] msgs.append(m) return trim_to_user_boundary(msgs) def trim_to_user_boundary (msgs: list[ dict ]) -> list[ dict ]: # Tool-calling APIs require an assistant message with tool_calls to be # immediately followed by role: tool messages for each call. After a # window slice we have to guard both ends: # (1) start on a user message — drop any orphan tool/assistant prefix # (2) drop trailing orphans: a role:tool with no live assistant before # it, or an assistant whose tool_calls were never answered. start = next ((i for i, m in enumerate (msgs) if m[ \"role\" ] == \"user\" ), None ) if start is None : return [] msgs = msgs[start:] while msgs and ( msgs[ - 1 ].get( \"role\" ) == \"tool\" or (msgs[ - 1 ].get( \"role\" ) == \"assistant\" and msgs[ - 1 ].get( \"tool_calls\" )) ): msgs.pop() return msgs The trim function is the one piece of subtlety. Naïve sliding-window truncation will sometimes cut between an assistant message that contains tool_calls and the role: tool messages that satisfy those calls. The next API request will reject that history with a confusing error. The fix is to guard both ends: skip any leading orphan tool or assistant fragments until we land on a user message, and drop any trailing assistant whose tool_calls were never answered. The loop now uses the DB as the single source of truth — save the user message first, then load history, then run the turn: # agent.py from tools import TOOLS , tool_specs, db from memory import save_message, load_history SYSTEM = { \"role\" : \"system\" , \"content\" : ( \"You are a personal planner. The user manages tasks through you. \" \"Use get_today before reasoning about relative dates like 'tomorrow' or 'next Tuesday'. \" \"Keep replies short.\" ), } def run (session: str , user_input: str ) -> str : save_message(db, session, { \"role\" : \"user\" , \"content\" : user_input}) messages = [ SYSTEM , * load_history(db, session)] for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) save_message(db, session, msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ) for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} tool_msg = { \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, } messages.append(tool_msg) save_message(db, session, tool_msg) return \"I hit my tool-call limit.\" # agent.py from tools import TOOLS , tool_specs, db from memory import save_message, load_history SYSTEM = { \"role\" : \"system\" , \"content\" : ( \"You are a personal planner. The user manages tasks through you. \" \"Use get_today before reasoning about relative dates like 'tomorrow' or 'next Tuesday'. \" \"Keep replies short.\" ), } def run (session: str , user_input: str ) -> str : save_message(db, session, { \"role\" : \"user\" , \"content\" : user_input}) messages = [ SYSTEM , * load_history(db, session)] for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) save_message(db, session, msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ) for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} tool_msg = { \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, } messages.append(tool_msg) save_message(db, session, tool_msg) return \"I hit my tool-call limit.\" Restart the REPL. Ask &quot;what was the last task I added?&quot; The agent answers from history. It remembers within and across sessions, up to HISTORY_LIMIT turns. What's still broken: the agent remembers the conversation but it has no notion of facts about you . Tell it &quot;I have Wednesday afternoons free&quot; today and ask &quot;when should I schedule a meeting?&quot; three weeks from now and it has no idea — that turn was trimmed out of the window long ago. Some things need to outlive the conversation buffer. Long-term memory — facts that outlive the window Short-term memory is indexed by recency ; long-term memory is indexed by meaning . You need both — conversations end but facts shouldn't. The data structure for &quot;indexed by meaning&quot; is an embedding : a fixed-length list of floats produced by a dedicated embedding model (here, nomic-embed-text ), trained so that two pieces of text with similar meaning land near each other in vector space. Search becomes a similarity comparison instead of a substring match. Two new tools ( remember and recall ) give the agent control over what to save and when to retrieve. CREATE TABLE IF NOT EXISTS memories ( id INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT NOT NULL, embedding BLOB NOT NULL, created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP ); Embeddings via Ollama, packed into the BLOB column as raw little-endian floats. Cosine similarity is a one-liner in pure Python — slow at a million rows, fine for a personal planner. # embeddings.py import struct, ollama EMBED_MODEL = \"nomic-embed-text\" def embed (text: str ) -> list[ float ]: res = ollama.embed( model = EMBED_MODEL , input = text) return res[ \"embeddings\" ][ 0 ] def pack (vec: list[ float ]) -> bytes : return struct.pack( f \"&#x3C; {len (vec) } f\" , * vec) # explicit little-endian float32 def unpack (blob: bytes ) -> list[ float ]: return list (struct.unpack( f \"&#x3C; {len (blob) // 4} f\" , blob)) def cosine (a: list[ float ], b: list[ float ]) -> float : dot = sum (x * y for x, y in zip (a, b)) na = sum (x * x for x in a) ** 0.5 nb = sum (y * y for y in b) ** 0.5 return dot / (na * nb) if na and nb else 0.0 # embeddings.py import struct, ollama EMBED_MODEL = \"nomic-embed-text\" def embed (text: str ) -> list[ float ]: res = ollama.embed( model = EMBED_MODEL , input = text) return res[ \"embeddings\" ][ 0 ] def pack (vec: list[ float ]) -> bytes : return struct.pack( f \"&#x3C; {len (vec) } f\" , * vec) # explicit little-endian float32 def unpack (blob: bytes ) -> list[ float ]: return list (struct.unpack( f \"&#x3C; {len (blob) // 4} f\" , blob)) def cosine (a: list[ float ], b: list[ float ]) -> float : dot = sum (x * y for x, y in zip (a, b)) na = sum (x * x for x in a) ** 0.5 nb = sum (y * y for y in b) ** 0.5 return dot / (na * nb) if na and nb else 0.0 The tools. recall does a full table scan on every call by default — fine for dozens of memories, slow past a few thousand. A process-level lazy cache ( _load_memory_cache ) fixes this: load all rows from SQLite once on first recall, then serve every subsequent query from RAM. remember appends to the same cache so a freshly stored fact is immediately searchable without a re-scan: _memory_cache: list[tuple[ str , list[ float ]]] | None = None def _load_memory_cache () -> list[tuple[ str , list[ float ]]]: global _memory_cache if _memory_cache is None : rows = db.execute( \"SELECT text, embedding FROM memories\" ).fetchall() _memory_cache = [(r[ \"text\" ], unpack(r[ \"embedding\" ])) for r in rows] return _memory_cache @tool ( \"remember\" , \"Store a durable fact about the user.\" , { \"type\" : \"object\" , \"properties\" : { \"fact\" : { \"type\" : \"string\" }}, \"required\" : [ \"fact\" ], }) def remember (fact: str ) -> dict : vec = embed(fact) db.execute( \"INSERT INTO memories (text, embedding) VALUES (?, ?)\" , (fact, pack(vec)), ) db.commit() if _memory_cache is not None : _memory_cache.append((fact, vec)) return { \"ok\" : True , \"fact\" : fact} @tool ( \"recall\" , \"Search long-term memory by meaning.\" , { \"type\" : \"object\" , \"properties\" : { \"query\" : { \"type\" : \"string\" }, \"k\" : { \"type\" : \"integer\" , \"default\" : 3 }, }, \"required\" : [ \"query\" ], }) def recall (query: str , k: int = 3 ) -> list[ dict ]: qv = embed(query) scored = sorted ( ((cosine(qv, vec), text) for text, vec in _load_memory_cache()), reverse = True , ) return [{ \"text\" : t, \"score\" : round (s, 3 )} for s, t in scored[:k]] _memory_cache: list[tuple[ str , list[ float ]]] | None = None def _load_memory_cache () -> list[tuple[ str , list[ float ]]]: global _memory_cache if _memory_cache is None : rows = db.execute( \"SELECT text, embedding FROM memories\" ).fetchall() _memory_cache = [(r[ \"text\" ], unpack(r[ \"embedding\" ])) for r in rows] return _memory_cache @tool ( \"remember\" , \"Store a durable fact about the user.\" , { \"type\" : \"object\" , \"properties\" : { \"fact\" : { \"type\" : \"string\" }}, \"required\" : [ \"fact\" ], }) def remember (fact: str ) -> dict : vec = embed(fact) db.execute( \"INSERT INTO memories (text, embedding) VALUES (?, ?)\" , (fact, pack(vec)), ) db.commit() if _memory_cache is not None : _memory_cache.append((fact, vec)) return { \"ok\" : True , \"fact\" : fact} @tool ( \"recall\" , \"Search long-term memory by meaning.\" , { \"type\" : \"object\" , \"properties\" : { \"query\" : { \"type\" : \"string\" }, \"k\" : { \"type\" : \"integer\" , \"default\" : 3 }, }, \"required\" : [ \"query\" ], }) def recall (query: str , k: int = 3 ) -> list[ dict ]: qv = embed(query) scored = sorted ( ((cosine(qv, vec), text) for text, vec in _load_memory_cache()), reverse = True , ) return [{ \"text\" : t, \"score\" : round (s, 3 )} for s, t in scored[:k]] The data path: flowchart LR subgraph short[Short-term memory] msgs[(messages&lt;br/&gt;table)] trim[trim on user&lt;br/&gt;boundary] msgs --&gt; trim end subgraph long[Long-term memory] mem[(memories&lt;br/&gt;text + embedding)] emb[nomic-embed-text] cos[cosine similarity&lt;br/&gt;in Python] end turn([user turn]) --&gt; trim trim --&gt; ctx[message context&lt;br/&gt;sent to LLM] turn --&gt;|&quot;remember(fact)&quot;| emb1[nomic-embed-text] emb1 --&gt; mem turn --&gt;|&quot;recall(query)&quot;| emb emb --&gt; cos mem --&gt; cos cos --&gt; ctx classDef store fill:#fef3c7,stroke:#d97706,color:#15171a classDef step fill:#dbeafe,stroke:#2563eb,color:#15171a classDef terminal fill:#eef2f7,stroke:#6b7280,color:#15171a class msgs,mem store class trim,emb,emb1,cos,ctx step class turn terminal Two design choices worth flagging: The model controls both writes and reads. It calls remember when a fact seems worth keeping and recall when it suspects relevant context exists. The system prompt is updated to nudge it: &quot;If the user states a durable preference or fact about themselves, call remember. If a question would benefit from past context, call recall.&quot; The alternative — extracting memories automatically in a background pass — is cleaner architecturally but takes more code; the tool-driven version is the from-scratch lesson. Cosine similarity is computed in Python, not in SQL. SQLite has no native vector type, and that's the point — seeing the loop makes clear what a vector DB is actually doing for you. The in-process cache keeps the scan out of the hot path; at a few thousand rows this is fine. Past ~100k rows you want a real vector index. Now the planner can carry facts forward indefinitely: you&gt; i have wednesday afternoons free for meetings bot&gt; Noted. [remember(fact=&quot;user has Wednesday afternoons free for meetings&quot;)] # ...some weeks later, new session... you&gt; when should i schedule the dentist? bot&gt; You mentioned Wednesday afternoons are free — want me to add it for next Wednesday? [recall(query=&quot;when is the user free for appointments&quot;)] What's still broken: complex requests fall apart. Ask &quot;prep for my doctor visit next Tuesday and pick up a gift before then&quot; and the agent often does one thing, forgets the other, or gets the dates wrong. It is reacting one tool at a time without ever looking at the whole request first. Planning — think before acting Without planning, the model reacts to each tool result in isolation and loses the thread on multi-step requests — does one thing, forgets the other. Planning separates deciding what to do from doing it . Even when both happen inside one LLM call, forcing the model to commit to a structure before it starts dispatching changes its behavior. The smallest planning intervention that helps: a system-prompt instruction that says &quot;For requests that involve more than one step, first write a one-line plan, then execute it.&quot; SYSTEM = { \"role\" : \"system\" , \"content\" : ( \"You are a personal planner. \\n \" \"- For greetings or trivial chit-chat, reply directly in one short sentence. Do NOT plan, do NOT call tools. \\n \" \"- Use get_today before reasoning about relative dates (tomorrow, next week, etc). \\n \" \"- For multi-step requests only, write a short plan first (1–3 bullets), then call the tools to execute it. \\n \" \"- If the user states a durable preference or fact about themselves, call remember. \\n \" \"- If a question would benefit from past context, call recall before answering. \\n \" \"- Call tools through the structured tool-call interface only. Never write tool calls as JSON in your reply text. \\n \" \"- After executing, summarize what you did in one line.\" ), } SYSTEM = { \"role\" : \"system\" , \"content\" : ( \"You are a personal planner. \\n \" \"- For greetings or trivial chit-chat, reply directly in one short sentence. Do NOT plan, do NOT call tools. \\n \" \"- Use get_today before reasoning about relative dates (tomorrow, next week, etc). \\n \" \"- For multi-step requests only, write a short plan first (1–3 bullets), then call the tools to execute it. \\n \" \"- If the user states a durable preference or fact about themselves, call remember. \\n \" \"- If a question would benefit from past context, call recall before answering. \\n \" \"- Call tools through the structured tool-call interface only. Never write tool calls as JSON in your reply text. \\n \" \"- After executing, summarize what you did in one line.\" ), } Two additions over the previous version: the chit-chat guard (without it, the model plans and calls get_today on &quot;hello&quot;) and the explicit tool-call interface instruction (some models occasionally write JSON tool calls as plain text instead of using the structured interface). That's still no new code and no separate &quot;planner&quot; LLM call. The plan lives in the assistant message that precedes the tool calls. It costs a few extra tokens and visibly improves multi-step accuracy because the model commits to a structure before it starts dispatching. A turn now looks like: you&gt; prep for my doctor visit next tuesday and pick up a gift before then bot&gt; Plan: - get today's date to resolve &quot;next Tuesday&quot; - add &quot;prep for doctor visit&quot; due that date - add &quot;pick up gift&quot; due the day before [get_today() → 2026-05-24] [add_task(title=&quot;prep for doctor visit&quot;, due_date=&quot;2026-06-02&quot;)] [add_task(title=&quot;pick up gift&quot;, due_date=&quot;2026-06-01&quot;)] bot&gt; Added both — doctor prep on Jun 2, gift on Jun 1. For more complex domains you'd promote planning to a dedicated LLM call that produces structured JSON, then iterate over the steps. For a personal planner, the prompt-only version is enough — and resisting the urge to over-engineer is part of the lesson. What's still broken: the agent sometimes silently fails — calls a tool, gets an error, ignores it, and tells you everything went fine. It needs to check its own work. Reflection — checking its own work Reflection adds a second pair of eyes — a fresh LLM call with no investment in the previous answer, which makes it willing to say &quot;that's wrong.&quot; The acting model has an implicit bias toward declaring success because it just spent tokens on the attempt; a separate critic call doesn't. Concretely: a second ollama.chat invocation with a different system prompt and the prior transcript as input. Mechanically: after the main loop finishes, a second LLM pass looks at the transcript and decides did we actually accomplish what the user asked? If not, the critique is fed back in as a new user message and the loop runs again, up to a small retry budget. The previous run becomes _act — same body, new name — and a new run wraps it with the reflection loop. Logging. Before anything else, wire up basic logging. Tool failures, reflection errors, and agent activity should leave a trace: import logging from concurrent.futures import ThreadPoolExecutor logging.basicConfig( filename = \"agent.log\" , level = logging. INFO , format = \" %(asctime)s %(levelname)s %(name)s : %(message)s \" , ) logging.getLogger( \"httpx\" ).setLevel(logging. WARNING ) # silence per-request noise log = logging.getLogger( \"agent\" ) import logging from concurrent.futures import ThreadPoolExecutor logging.basicConfig( filename = \"agent.log\" , level = logging. INFO , format = \" %(asctime)s %(levelname)s %(name)s : %(message)s \" , ) logging.getLogger( \"httpx\" ).setLevel(logging. WARNING ) # silence per-request noise log = logging.getLogger( \"agent\" ) Models and performance constants. Reflection is a yes/no judgement ( done: true/false ), not generation — a smaller, faster model handles it well. Using a separate REFLECT_MODEL roughly halves the critic's latency. KEEP_ALIVE keeps both models resident in VRAM between turns (Ollama evicts after 5 minutes by default). CHAT_OPTIONS and REFLECT_OPTIONS cap context and generation length so the agent's memory use stays bounded: MODEL = \"qwen3.5:9b\" REFLECT_MODEL = \"qwen3.5:4b\" # smaller critic — reflection is yes/no, not generation MAX_TURNS = 4 MAX_REFLECTIONS = 2 KEEP_ALIVE = \"24h\" CHAT_OPTIONS = { \"num_ctx\" : 4096 , \"num_predict\" : 512 } REFLECT_OPTIONS = { \"num_ctx\" : 2048 , \"num_predict\" : 128 } _tool_pool = ThreadPoolExecutor( max_workers = 4 ) REFLECT_PROMPT = ( \"You are reviewing an agent transcript. Given the user's original request \" \"and the actions taken, answer in JSON: \" '{\"done\": true|false, \"critique\": \"...\"}. ' \"Set done=true if the request was fully satisfied. \" \"Set done=false and provide a concrete critique if anything is missing or wrong.\" ) MODEL = \"qwen3.5:9b\" REFLECT_MODEL = \"qwen3.5:4b\" # smaller critic — reflection is yes/no, not generation MAX_TURNS = 4 MAX_REFLECTIONS = 2 KEEP_ALIVE = \"24h\" CHAT_OPTIONS = { \"num_ctx\" : 4096 , \"num_predict\" : 512 } REFLECT_OPTIONS = { \"num_ctx\" : 2048 , \"num_predict\" : 128 } _tool_pool = ThreadPoolExecutor( max_workers = 4 ) REFLECT_PROMPT = ( \"You are reviewing an agent transcript. Given the user's original request \" \"and the actions taken, answer in JSON: \" '{\"done\": true|false, \"critique\": \"...\"}. ' \"Set done=true if the request was fully satisfied. \" \"Set done=false and provide a concrete critique if anything is missing or wrong.\" ) Concurrent tool dispatch. The model can return multiple tool_calls in one response. Dispatching them sequentially wastes time when they're independent — get_today and recall have no ordering constraint between them. _dispatch isolates each call so the thread pool can run them concurrently; _tool_pool.map preserves insertion order so each tool message lines up with its originating call: def _dispatch (call: dict ) -> dict : name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} if isinstance (args, str ): # some models return arguments as JSON text args = json.loads(args) try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: log.exception( \"tool %s failed with args= %r \" , name, args) result = { \"error\" : str (e)} return { \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name} def _act (messages: list[ dict ]) -> tuple[ str , list[ dict ]]: \"\"\"One pass of the tool-calling loop. Mutates `messages` in-place and returns the slice of new turns this call added.\"\"\" start = len (messages) for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs(), options = CHAT_OPTIONS , keep_alive = KEEP_ALIVE , ) msg = to_dict(res[ \"message\" ]) messages.append(msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ), messages[start:] for tool_msg in _tool_pool.map(_dispatch, calls): messages.append(tool_msg) return \"I hit my tool-call limit.\" , messages[start:] def _dispatch (call: dict ) -> dict : name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} if isinstance (args, str ): # some models return arguments as JSON text args = json.loads(args) try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: log.exception( \"tool %s failed with args= %r \" , name, args) result = { \"error\" : str (e)} return { \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name} def _act (messages: list[ dict ]) -> tuple[ str , list[ dict ]]: \"\"\"One pass of the tool-calling loop. Mutates `messages` in-place and returns the slice of new turns this call added.\"\"\" start = len (messages) for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs(), options = CHAT_OPTIONS , keep_alive = KEEP_ALIVE , ) msg = to_dict(res[ \"message\" ]) messages.append(msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ), messages[start:] for tool_msg in _tool_pool.map(_dispatch, calls): messages.append(tool_msg) return \"I hit my tool-call limit.\" , messages[start:] def _reflect (original: str , reply: str , messages: list[ dict ]) -> dict : transcript = \" \\n \" .join( f \" { m[ 'role' ] } : { m.get( 'content' , '' ) or m.get( 'tool_calls' , '' ) } \" for m in messages[ - 8 :] ) res = ollama.chat( model = REFLECT_MODEL , messages = [ { \"role\" : \"system\" , \"content\" : REFLECT_PROMPT }, { \"role\" : \"user\" , \"content\" : f \"Original request: { original }\\n\\n Transcript: \\n{ transcript }\\n\\n Final reply: { reply } \" }, ], format = \"json\" , options = REFLECT_OPTIONS , keep_alive = KEEP_ALIVE , ) content = to_dict(res[ \"message\" ]).get( \"content\" ) or \" {} \" try : return json.loads(content) except (json.JSONDecodeError, TypeError ): return { \"done\" : True , \"critique\" : \"\" } # fail open on malformed output def _reflect (original: str , reply: str , messages: list[ dict ]) -> dict : transcript = \" \\n \" .join( f \" { m[ 'role' ] } : { m.get( 'content' , '' ) or m.get( 'tool_calls' , '' ) } \" for m in messages[ - 8 :] ) res = ollama.chat( model = REFLECT_MODEL , messages = [ { \"role\" : \"system\" , \"content\" : REFLECT_PROMPT }, { \"role\" : \"user\" , \"content\" : f \"Original request: { original }\\n\\n Transcript: \\n{ transcript }\\n\\n Final reply: { reply } \" }, ], format = \"json\" , options = REFLECT_OPTIONS , keep_alive = KEEP_ALIVE , ) content = to_dict(res[ \"message\" ]).get( \"content\" ) or \" {} \" try : return json.loads(content) except (json.JSONDecodeError, TypeError ): return { \"done\" : True , \"critique\" : \"\" } # fail open on malformed output Skipping reflection for read-only turns. Reflection only earns its cost when a mutation could be wrong or incomplete. A turn that only called list_tasks , recall , or get_today has nothing to verify — the model can already see whether the result is right in the transcript. When the turn's tool calls are a subset of READ_ONLY_TOOLS , the reflection pass is skipped entirely. Persistence. An earlier version of this code persisted only the winning attempt's turns, leaving the user message dangling on failed reflection. The next turn then had two consecutive user messages, which the tool-calling API rejected. The fix: always persist the final attempt regardless of whether reflection passed — the user already saw reply , so the DB must match: def run (session: str , user_input: str ) -> str : save_message(db, session, { \"role\" : \"user\" , \"content\" : user_input}) original = user_input messages = [ SYSTEM , * load_history(db, session)] reply, added = \"\" , [] for attempt in range ( MAX_REFLECTIONS + 1 ): reply, added = _act(messages) # Skip reflection when only read-only tools were called — nothing to verify. tool_names = {m[ \"tool_name\" ] for m in added if m.get( \"role\" ) == \"tool\" } if not tool_names or tool_names &#x3C;= READ_ONLY_TOOLS : break try : verdict = _reflect(original, reply, messages) except Exception : log.exception( \"reflect failed\" ) verdict = { \"done\" : True , \"critique\" : \"\" } if verdict[ \"done\" ]: break if attempt == MAX_REFLECTIONS : break # Critique is appended in-memory only — never persisted. messages.append({ \"role\" : \"user\" , \"content\" : f \"Your previous attempt was incomplete: { verdict[ 'critique' ] } \" , }) # Always persist the final attempt — the user already saw `reply`, # so the DB must match or next turn gets two consecutive user messages. for m in added: save_message(db, session, m) return reply def run (session: str , user_input: str ) -> str : save_message(db, session, { \"role\" : \"user\" , \"content\" : user_input}) original = user_input messages = [ SYSTEM , * load_history(db, session)] reply, added = \"\" , [] for attempt in range ( MAX_REFLECTIONS + 1 ): reply, added = _act(messages) # Skip reflection when only read-only tools were called — nothing to verify. tool_names = {m[ \"tool_name\" ] for m in added if m.get( \"role\" ) == \"tool\" } if not tool_names or tool_names &#x3C;= READ_ONLY_TOOLS : break try : verdict = _reflect(original, reply, messages) except Exception : log.exception( \"reflect failed\" ) verdict = { \"done\" : True , \"critique\" : \"\" } if verdict[ \"done\" ]: break if attempt == MAX_REFLECTIONS : break # Critique is appended in-memory only — never persisted. messages.append({ \"role\" : \"user\" , \"content\" : f \"Your previous attempt was incomplete: { verdict[ 'critique' ] } \" , }) # Always persist the final attempt — the user already saw `reply`, # so the DB must match or next turn gets two consecutive user messages. for m in added: save_message(db, session, m) return reply Update the import at the top of agent.py to include READ_ONLY_TOOLS : from tools import READ_ONLY_TOOLS , TOOLS , db, tool_specs from memory import save_message, load_history from tools import READ_ONLY_TOOLS , TOOLS , db, tool_specs from memory import save_message, load_history The full cycle: flowchart LR user([user request]) plan[&quot;1. Plan&lt;br/&gt;(in-prompt)&quot;] act[&quot;2. Act&lt;br/&gt;tool-calling loop&lt;br/&gt;(concurrent dispatch)&quot;] reflect[&quot;3. Reflect&lt;br/&gt;(qwen3.5:4b)&quot;] skip([reply]) done([reply]) user --&gt; plan --&gt; act --&gt; reflect reflect --&gt;|done=true| done reflect --&gt;|done=false&lt;br/&gt;critique fed back| plan act --&gt;|read-only tools only| skip classDef step fill:#dbeafe,stroke:#2563eb,color:#15171a classDef terminal fill:#eef2f7,stroke:#6b7280,color:#15171a class plan,act,reflect step class user,done,skip terminal Three things worth knowing: Reflection is expensive. It adds at least one extra LLM call per mutating turn. The READ_ONLY_TOOLS skip recovers most of that cost on lookup-heavy sessions. For a local Ollama planner where calls are free and relatively slow, always running on mutations is fine. Separate critic model saves latency. qwen3.5:4b handles the yes/no judgement in roughly half the time of the main model. KEEP_ALIVE = &quot;24h&quot; keeps both models warm in VRAM so neither reloads mid-session. Fail open on bad reflection output. If the reflector returns malformed JSON, return done=true rather than retrying forever. Trusting the critic blindly is how you get infinite loops. Putting it together All the pieces, in one file: agent.py # the loop, plan, reflect, REPL entry tools.py # tool registry, task tools (add/list/complete/update/delete), get_today, remember, recall memory.py # save/load messages, trim embeddings.py # embed, pack/unpack, cosine planner.db # SQLite, created on first run agent.log # structured log of tool calls and errors — created on first run ~500 lines of Python, no third-party dependencies except ollama . Clone the repo if you don't want to type the files out yourself: git clone https://github.com/devdaviddr/personal-planner-agent.git cd personal-planner-agent git clone https://github.com/devdaviddr/personal-planner-agent.git cd personal-planner-agent The REPL entry point at the bottom of agent.py : if __name__ == \"__main__\" : import uuid session = uuid.uuid4().hex # one session per process; swap for a stable id to span runs while (text := input ( \"you> \" ).strip()): print ( f \"bot> { run(session, text) }\\n \" ) if __name__ == \"__main__\" : import uuid session = uuid.uuid4().hex # one session per process; swap for a stable id to span runs while (text := input ( \"you> \" ).strip()): print ( f \"bot> { run(session, text) }\\n \" ) Then: pip install 'ollama>=0.4' python agent.py pip install 'ollama>=0.4' python agent.py A real transcript from the finished agent: you&gt; i have wednesday afternoons free bot&gt; Noted. you&gt; prep for my doctor visit next tuesday and pick up a gift before then bot&gt; Plan: - check today's date to resolve &quot;next Tuesday&quot; - add a prep task for that day - add a gift task for the day before Added: &quot;prep for doctor visit&quot; due 2026-06-02, &quot;pick up gift&quot; due 2026-06-01. you&gt; when should i schedule the follow-up? bot&gt; You mentioned Wednesday afternoons are free — next Wednesday is 2026-05-27. Want me to add it? you&gt; yes bot&gt; Added: &quot;doctor follow-up&quot; due 2026-05-27. you&gt; what's on my list? bot&gt; 3 open tasks: 1. pick up gift — 2026-06-01 2. prep for doctor visit — 2026-06-02 3. doctor follow-up — 2026-05-27 The remember / recall calls, the get_today resolution, the plan, and the silent reflection pass are all happening in the background. From the user's seat it just feels like the agent is thinking. What frameworks add (and what they take) Now that you've built one from scratch, here's what an agent framework actually gives you, so you can decide when it's worth reaching for one: Feature you wrote What a framework adds TOOLS dict + JSON-schema decorator Auto-generation from Python type hints, async dispatch, parallel tool execution. messages table + manual trim Pluggable memory backends (Redis, Postgres+pgvector, managed services), automatic summarization, token-aware truncation. recall() over SQLite A real vector DB (Chroma, LanceDB, Pinecone) with proper indexing for &gt;100k vectors. Plan-then-act in the system prompt Multi-step planners that emit structured DAGs, with per-step retries. Single-shot reflection Critic agents, self-consistency voting, debate loops. One agent, one loop Multi-agent orchestration, message passing, handoff protocols. Rule of thumb: ~10k+ memories (you need a real vector index), multiple coordinated agents (you need orchestration), async or parallel tool execution, or multi-tenant SLAs. Any single one of these is a yellow flag worth thinking about; two or more and you should reach for a framework. Below that, the from-scratch version is faster to debug and ships sooner. For a personal planner with a few thousand tasks and one user, you don't need any of it. For a customer-facing system with multiple specialized agents, ten million memories, and SLA-bound latency, you do. The point of writing the from-scratch version is that you now know exactly what you're trading away when you adopt a framework — and what you'd have to rebuild if you ever ripped one out. Troubleshooting Symptom Cause and fix ollama._types.ResponseError: model not found You haven't pulled the model. Run ollama pull qwen3.5:9b (and qwen3.5:4b , nomic-embed-text ). Model never calls tools, just answers in prose Either the model is too small or the tool descriptions are too vague. Try qwen3.5:9b ; if you must use a 3B model, write longer, more imperative descriptions (&quot;Use this to...&quot;). role: tool rejected by Ollama Your history was sliced mid-tool-call. Confirm trim_to_user_boundary is running. The same bug occurs if you forget to persist the assistant tool_calls message. recall returns garbage matches Your nomic-embed-text pull is incomplete or you're packing/unpacking with mismatched dtypes. Re-run ollama pull nomic-embed-text and verify embedding length is 768 ( len(embed(&quot;test&quot;)) ). Agent gets dates wrong The system prompt instruction to call get_today first is missing or the model is ignoring it. Make the instruction more emphatic, or compute today's date in Python and inject it into the system prompt every turn. Reflection loops forever MAX_REFLECTIONS is too high or your reflector is overly strict. Cap at 2 and fail open on malformed JSON output. Slow first reply Ollama loads the model into VRAM on the first request. Subsequent calls are fast. KEEP_ALIVE = &quot;24h&quot; in agent.py keeps both models warm across turns; pre-warm on startup with curl http://localhost:11434/api/generate -d '{&quot;model&quot;:&quot;qwen3.5:9b&quot;,&quot;prompt&quot;:&quot;hi&quot;}' if needed. Result A local, fully-private personal-planner agent in roughly 500 lines of Python. It runs on your laptop, persists everything to one SQLite file, and demonstrates each of the agent fundamentals — tools, short-term memory, long-term memory, planning, reflection — as a discrete, removable layer rather than as framework magic. The same skeleton generalizes. Swap the task tools for GitHub Issues, Linear, your calendar, or any REST API and you have a domain-specific agent on the same footing. Swap the SQLite memory tables for Postgres and you have something multi-user. The reflection pass shown here is closest to self-refine — within-turn critique-and-retry. Persist the critiques across episodes and you have the start of a true Reflexion system (Shinn et al., 2023), where the agent learns from its own past mistakes over time. The point isn't the planner. The point is that you've seen each fundamental in isolation and can now build, debug, or replace any of them without the framework that usually hides them. Source Full source for this tutorial: github.com/devdaviddr/personal-planner-agent ."
  },
  {
    "slug": "2026-05-11-local-ai-trello-bot-mcp-ollama-telegram",
    "title": "Tutorial: Build a Local-AI Trello Bot with MCP, Ollama, and Telegram",
    "description": "A step-by-step tutorial for setting up and understanding a fully-local Trello bot. The stack: a Telegram chat surface, an Ollama-hosted LLM, and an MCP server exposing 67 Trello tools. Nothing leaves your network.",
    "tags": [
      "ai",
      "ollama",
      "mcp",
      "telegram",
      "typescript",
      "self-hosting",
      "tutorial"
    ],
    "excerpt": "This tutorial walks you through setting up a Telegram bot that lets you manage your Trello boards in plain English, backed by a local Ollama instance and a 67-tool MCP server. By the end, you will have: A Telegram bot you can DM with requests like &q",
    "content": "This tutorial walks you through setting up a Telegram bot that lets you manage your Trello boards in plain English, backed by a local Ollama instance and a 67-tool MCP server. By the end, you will have: A Telegram bot you can DM with requests like &quot;what's overdue?&quot; or &quot;add a card to Roadmap called 'investigate flaky CI'&quot; . An MCP server exposing 67 Trello tools, reusable from any MCP host (Claude Desktop, the MCP Inspector, etc.). A Docker Compose deployment that runs the whole thing in a single container. A working understanding of how the pieces fit together so you can extend it for other SaaS APIs. Part 1: What you will build The system has three moving parts: Telegram , the user-facing chat surface. A bot process , which receives messages, drives an agent loop against Ollama, and dispatches tool calls. An MCP server , a subprocess of the bot that exposes Trello operations as typed tools. A separate Ollama host on your LAN runs the LLM. Trello's REST API is the only off-network dependency. flowchart LR user([&quot;Telegram user&quot;]) tg[&quot;Telegram API&quot;] bot[&quot;bot process&quot;] ollama[(&quot;Ollama&lt;br/&gt;LAN GPU box&quot;)] mcp[&quot;MCP server&quot;] trello[(&quot;Trello REST API&quot;)] user --&gt; tg tg --&gt; bot bot &lt;--&gt;|tool-calling| ollama bot -. stdio .-&gt; mcp mcp --&gt; trello classDef external fill:#eef2f7,stroke:#6b7280,color:#15171a classDef internal fill:#dbeafe,stroke:#2563eb,color:#15171a class user,tg,trello external class bot,ollama,mcp internal What is MCP? Model Context Protocol is a small standard for connecting LLMs to tools. The shape: An MCP server exposes a set of tools. Each tool has a name, a description, a JSON schema for its arguments, and a handler that performs the work. An MCP client (an LLM host like Claude Desktop, or a bot you write yourself) connects to the server, asks for the tool catalog, and dispatches the tools the model decides to call. The transport is either stdio (parent-child process) or HTTP/SSE (for remote servers). The benefit: the same MCP server you build for your bot is reusable from Claude Desktop, the MCP Inspector , or any future MCP host. Write the integration once, use it from anywhere. Part 2: Prerequisites Before you start, make sure you have the following installed and accessible: Requirement Notes Docker + Docker Compose Tested on Docker Desktop (macOS) and Docker Engine (Linux). An Ollama instance Reachable from the container. Default model qwen3-coder:latest needs ~16 GB VRAM. A Trello account Free tier works. You will create an API key and a token. A Telegram account Free. You will create a bot and find your numeric user id. A code editor Any. You will edit one .env file. Hardware note: Ollama can run on CPU but is too slow for an interactive chat experience. A GPU with at least 16 GB VRAM is recommended for the default model. If you only have 8 GB, swap to a smaller tool-calling model such as llama3.1:8b . Part 3: Get your Trello credentials You need two strings from Trello: an API key and a token . 3.1 Create a Power-Up to get an API key Open https://trello.com/power-ups/admin in a browser (logged in to Trello). Click New to create a Power-Up. Fill in any name and workspace. You are not actually shipping a Power-Up; you only need the credentials it generates. After creation, click the Power-Up, then open the API key tab. Click Generate a new API key . Copy the value and save it as TRELLO_API_KEY . 3.2 Generate a token On the same API key tab, find the description text on the right that contains a blue Token link. Click it. Trello will ask you to authorize the Power-Up against your account. Click Allow . Trello returns a long string. Copy it and save it as TRELLO_API_TOKEN . Common mistake: the Secret on the API key tab is not the token. The token is what you get after clicking the blue Token link and authorizing. Using the Secret instead of the Token is the most common cause of 401 Unauthorized errors later. Part 4: Create your Telegram bot 4.1 Talk to BotFather In Telegram, search for @BotFather and open a chat. Send /newbot . Answer the prompts: a display name (anything) and a username ending in bot (must be globally unique). BotFather replies with a token that looks like 123456:ABC-DEF... . Save it as TELEGRAM_BOT_TOKEN . The exchange looks roughly like this: You /newbot BotFather Alright, a new bot. How are we going to call it? Please choose a name for your bot. You My Trello Bot BotFather Good. Now let's choose a username for your bot. It must end in `bot`. You my_trello_bot BotFather Done! Congratulations on your new bot. Use this token to access the HTTP API: 123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11 Keep your token secure... Keep this token private. Anyone with it can impersonate your bot. 4.2 Find your numeric Telegram user id The bot uses your numeric Telegram id (not your @handle ) for authorization. In Telegram, search for @userinfobot . Send any message. It replies with your numeric id (something like 987654321 ). Save it as TELEGRAM_ALLOWED_USER_IDS . If you want to allow multiple users, list ids comma-separated: 123,456,789 . Part 5: Set up Ollama Ollama runs the LLM. You can host it on the same machine as the bot, or on a separate GPU box on your LAN. 5.1 Install Ollama Follow the install instructions at https://ollama.com . On macOS: brew install ollama ollama serve brew install ollama ollama serve On Linux: curl -fsSL https://ollama.com/install.sh | sh curl -fsSL https://ollama.com/install.sh | sh 5.2 Pull a tool-calling model The default model used by this bot is qwen3-coder:latest . Pull it: ollama pull qwen3-coder:latest ollama pull qwen3-coder:latest You should see something like this once it finishes: pulling manifest pulling 0b8c4f5e7e9a... 100% ▕████████████████▏ 18 GB pulling 9f2c8a... 100% ▕████████████████▏ 12 KB pulling 7d6f1a... 100% ▕████████████████▏ 1.4 KB verifying sha256 digest writing manifest success Tested models that work: qwen3-coder:latest (~16 GB VRAM, recommended) qwen-pro:latest llama3.1:8b (works on smaller GPUs) Avoid: Gemma family models. Tool-calling reliability across a 67-tool surface is too low for an agent loop. 5.3 Confirm it is reachable If Ollama runs on the same machine as the bot, the default http://localhost:11434 works. If it runs on a different machine on your LAN, find its IP and confirm: curl http:// &#x3C; ollama-i p > :11434/api/tags curl http:// &#x3C; ollama-i p > :11434/api/tags You should see a JSON list of installed models. Save the URL as OLLAMA_HOST for later. Part 6: Clone, configure, and run You now have all four secrets and a working Ollama. Time to start the bot. 6.1 Clone the repository git clone https://github.com/devdaviddr/trello-mcp-service.git cd trello-mcp-service git clone https://github.com/devdaviddr/trello-mcp-service.git cd trello-mcp-service 6.2 Configure your environment Copy the example file and fill in your values: cp .env.example .env $EDITOR .env cp .env.example .env $EDITOR .env The minimum you must set: TRELLO_API_KEY = ... TRELLO_API_TOKEN = ... TELEGRAM_BOT_TOKEN = ... TELEGRAM_ALLOWED_USER_IDS = ... OLLAMA_HOST = http://host.docker.internal:11434 # if Ollama is on the host OLLAMA_MODEL = qwen3-coder:latest TRELLO_API_KEY = ... TRELLO_API_TOKEN = ... TELEGRAM_BOT_TOKEN = ... TELEGRAM_ALLOWED_USER_IDS = ... OLLAMA_HOST = http://host.docker.internal:11434 # if Ollama is on the host OLLAMA_MODEL = qwen3-coder:latest OLLAMA_HOST from inside Docker: Same machine, macOS/Windows: http://host.docker.internal:11434 Same machine, Linux: http://host.docker.internal:11434 (the included extra_hosts config makes this work) Different machine on LAN: http://&lt;lan-ip&gt;:11434 6.3 Start the container docker compose up --build docker compose up --build The first build takes a minute or two. Once running you should see logs like: trello-bot | [mcp-server] connecting trello client trello-bot | [mcp-server] registered 67 tools trello-bot | [bot] ollama host: http://host.docker.internal:11434 trello-bot | [bot] model: qwen3-coder:latest trello-bot | [bot] starting long-poll... 6.4 Test it Open Telegram, find your bot by the username you gave BotFather, and send /start . The bot will greet you back. Now try a real query: what boards do I have? The first reply will take 20–60 seconds while Ollama loads model weights into VRAM. Subsequent replies should land in 1–3 seconds. Built-in commands: /start : greeting /reset : clear this chat's conversation history /whoami : show your Telegram numeric id and whether you are authorized (use this if the bot replies &quot;Not authorized&quot;) Part 7: How it works under the hood Now that the bot is running, this section explains the implementation so you can extend or fork it. 7.1 Defining a tool Each Trello operation is registered as one MCP tool. The project uses zod for schemas. One definition gives compile-time types and runtime validation, and converts cleanly to JSON Schema for the LLM. def ( \"create_card\" , \"Create a new card in a list.\" , z. object ({ list_id: z. string (), name: z. string (), description: z. string (). optional (), due: z. string (). optional (). describe ( \"ISO 8601 due date\" ), }), async ( args , trello ) => trello.cards. create (args.list_id, args.name, args.description, args.due), ); def ( \"create_card\" , \"Create a new card in a list.\" , z. object ({ list_id: z. string (), name: z. string (), description: z. string (). optional (), due: z. string (). optional (). describe ( \"ISO 8601 due date\" ), }), async ( args , trello ) => trello.cards. create (args.list_id, args.name, args.description, args.due), ); The handler delegates to a thin Trello REST client. zod parses the LLM's arguments at runtime, so if the model hallucinates a field type or omits a required arg, the call is rejected with a readable error string. That error becomes the next role: &quot;tool&quot; message, and the model uses it to fix its mistake on the next turn. This pattern is repeated 67 times, one tool per Trello capability. flowchart LR zod[&quot;zod schema&lt;br/&gt;z.object({ list_id, name, ... })&quot;] ts[&quot;TypeScript types&lt;br/&gt;(compile time)&quot;] json[&quot;JSON Schema&lt;br/&gt;(advertised to the LLM)&quot;] parser[&quot;zod.parse(args)&lt;br/&gt;(catches LLM hallucinations)&quot;] handler[&quot;handler(args, trello)&quot;] trello[(&quot;Trello REST API&quot;)] err[&quot;error string&lt;br/&gt;→ role: tool message&lt;br/&gt;→ model retries next turn&quot;] zod --&gt; ts zod --&gt; json zod --&gt; parser parser --&gt;|ok| handler parser --&gt;|fail| err handler --&gt; trello classDef source fill:#dbeafe,stroke:#2563eb,color:#15171a classDef artifact fill:#eef2f7,stroke:#6b7280,color:#15171a classDef external fill:#fef3c7,stroke:#d97706,color:#15171a class zod source class ts,json,parser,handler,err artifact class trello external 7.2 Running the MCP server over stdio The MCP server is a small glue file: import { Server } from \"@modelcontextprotocol/sdk/server/index.js\" ; import { StdioServerTransport } from \"@modelcontextprotocol/sdk/server/stdio.js\" ; const server = new Server ( { name: \"trello-mcp\" , version: \"0.1.0\" }, { capabilities: { tools: {} } }, ); server. setRequestHandler (ListToolsRequestSchema, async () => ({ tools: toolSchemas })); server. setRequestHandler (CallToolRequestSchema, async ( req ) => { const tool = toolsByName. get (req.params.name); if ( ! tool) { return { isError: true , content: [{ type: \"text\" , text: `Unknown tool: ${ req . params . name }` }] }; } try { const result = await tool. handler (req.params.arguments ?? {}, trello); return { content: [{ type: \"text\" , text: JSON . stringify (result ?? { ok: true }) }] }; } catch (err) { const message = err instanceof Error ? err.message : String (err); return { isError: true , content: [{ type: \"text\" , text: message }] }; } }); await server. connect ( new StdioServerTransport ()); import { Server } from \"@modelcontextprotocol/sdk/server/index.js\" ; import { StdioServerTransport } from \"@modelcontextprotocol/sdk/server/stdio.js\" ; const server = new Server ( { name: \"trello-mcp\" , version: \"0.1.0\" }, { capabilities: { tools: {} } }, ); server. setRequestHandler (ListToolsRequestSchema, async () => ({ tools: toolSchemas })); server. setRequestHandler (CallToolRequestSchema, async ( req ) => { const tool = toolsByName. get (req.params.name); if ( ! tool) { return { isError: true , content: [{ type: \"text\" , text: `Unknown tool: ${ req . params . name }` }] }; } try { const result = await tool. handler (req.params.arguments ?? {}, trello); return { content: [{ type: \"text\" , text: JSON . stringify (result ?? { ok: true }) }] }; } catch (err) { const message = err instanceof Error ? err.message : String (err); return { isError: true , content: [{ type: \"text\" , text: message }] }; } }); await server. connect ( new StdioServerTransport ()); stdio means the server runs as a subprocess of whoever launches it. No port to expose, no auth layer to manage, zero network latency on each tool call. The same binary works standalone with Claude Desktop pointed at it, covered in Part 8. 7.3 The Trello REST client with retries Trello rate-limits at 100 requests per 10 seconds per token. A naïve fetch will fail on the first 429. The request layer in this project retries with jittered exponential backoff and honors Retry-After when Trello provides it. async request &#x3C; T >(method: string, path: string, params: QueryParams = {}): Promise &#x3C; T > { const url = `${ BASE }${ path }?${ this . auth ( params ) }` ; let lastBody = \"\" ; let lastStatus = 0 ; for ( let attempt = 1 ; attempt &#x3C; = MAX_ATTEMPTS ; attempt ++) { const res = await fetch (url, { method }); if (res.ok) { const text = await res. text (); return text ? ( JSON . parse (text) as T ) : ( undefined as T ); } lastStatus = res.status; lastBody = await res. text (); if ( ! RETRY_STATUSES . has (res.status) || attempt === MAX_ATTEMPTS ) break ; const retryAfter = Number (res.headers. get ( \"retry-after\" )); const backoff = Number. isFinite (retryAfter) &#x26;&#x26; retryAfter > 0 ? retryAfter * 1000 : Math. min ( 8000 , 500 * 2 ** (attempt - 1 )) + Math. random () * 250 ; await sleep (backoff); } throw new Error ( `Trello ${ method } ${ path } failed: ${ lastStatus } ${ lastBody }` ); } async request &#x3C; T >(method: string, path: string, params: QueryParams = {}): Promise &#x3C; T > { const url = `${ BASE }${ path }?${ this . auth ( params ) }` ; let lastBody = \"\" ; let lastStatus = 0 ; for ( let attempt = 1 ; attempt &#x3C; = MAX_ATTEMPTS ; attempt ++) { const res = await fetch (url, { method }); if (res.ok) { const text = await res. text (); return text ? ( JSON . parse (text) as T ) : ( undefined as T ); } lastStatus = res.status; lastBody = await res. text (); if ( ! RETRY_STATUSES . has (res.status) || attempt === MAX_ATTEMPTS ) break ; const retryAfter = Number (res.headers. get ( \"retry-after\" )); const backoff = Number. isFinite (retryAfter) &#x26;&#x26; retryAfter > 0 ? retryAfter * 1000 : Math. min ( 8000 , 500 * 2 ** (attempt - 1 )) + Math. random () * 250 ; await sleep (backoff); } throw new Error ( `Trello ${ method } ${ path } failed: ${ lastStatus } ${ lastBody }` ); } Settings: RETRY_STATUSES is {429, 502, 503, 504} . Up to 4 attempts. The final error includes the status and response body, so failures are debuggable from logs. This single function carries every Trello call in the codebase. 7.4 The agent loop The Ollama npm package speaks the tool-calling API directly, so the loop is short: for ( let turn = 0 ; turn &#x3C; MAX_TURNS ; turn ++ ) { const res = await ollama. chat ({ model, messages, tools, stream: false }); const msg = res.message; messages. push (msg); const calls = msg.tool_calls ?? []; if (calls. length === 0 ) return { reply: msg.content ?? \"\" }; for ( const call of calls) { let toolResult : string ; try { toolResult = await mcp. callTool (call.function.name, normalizeArgs (call.function.arguments)); } catch (err) { toolResult = `ERROR: ${ err instanceof Error ? err . message : String ( err ) }` ; } messages. push ({ role: \"tool\" , content: truncate (toolResult), tool_name: call.function.name }); } } for ( let turn = 0 ; turn &#x3C; MAX_TURNS ; turn ++ ) { const res = await ollama. chat ({ model, messages, tools, stream: false }); const msg = res.message; messages. push (msg); const calls = msg.tool_calls ?? []; if (calls. length === 0 ) return { reply: msg.content ?? \"\" }; for ( const call of calls) { let toolResult : string ; try { toolResult = await mcp. callTool (call.function.name, normalizeArgs (call.function.arguments)); } catch (err) { toolResult = `ERROR: ${ err instanceof Error ? err . message : String ( err ) }` ; } messages. push ({ role: \"tool\" , content: truncate (toolResult), tool_name: call.function.name }); } } What it does: If tool_calls is empty, the model has produced its final answer and the loop returns. Otherwise it dispatches each call to the MCP server and pushes the result back as a role: &quot;tool&quot; message. Errors are included; that is how the model recovers. MAX_TURNS defaults to 16 so a confused model cannot spin forever. Tool output is truncated to a 16 KB budget before entering history, so a large list_boards does not blow past the context window. 7.5 Telegram wiring The Telegram side, using grammy : const bot = new Bot (token); bot. on ( \"message:text\" , async ( ctx ) => { if ( ! isAuthorized (ctx.from?.id)) return ctx. reply ( \"Not authorized\" ); await chatQueue. run (ctx.chat.id, async () => { const history = historyStore. get (ctx.chat.id); const { reply , history : next } = await agent. chat (history, ctx.message.text); historyStore. set (ctx.chat.id, next); await ctx. reply (reply); }); }); await bot. start (); const bot = new Bot (token); bot. on ( \"message:text\" , async ( ctx ) => { if ( ! isAuthorized (ctx.from?.id)) return ctx. reply ( \"Not authorized\" ); await chatQueue. run (ctx.chat.id, async () => { const history = historyStore. get (ctx.chat.id); const { reply , history : next } = await agent. chat (history, ctx.message.text); historyStore. set (ctx.chat.id, next); await ctx. reply (reply); }); }); await bot. start (); Two non-obvious details, learned the hard way: chatQueue serializes messages per chat. If two messages arrive in the same chat before the first finishes, both handlers would read the same starting history, and the second one's set() would clobber the first. A small Promise-queue keyed by chat id prevents this. History trim must land on a user-message boundary. Tool-calling APIs require an assistant message with tool_calls to be immediately followed by role: &quot;tool&quot; messages for each call. A naïve slice(-40) can leave an orphan tool result, and the next API call rejects it. The project's trim walks the cut point forward until it lands on role: &quot;user&quot; . Part 8: Use the MCP server standalone The MCP server is independent of the bot. You can plug it into any MCP host. 8.1 With Claude Desktop Add this to claude_desktop_config.json : { \"mcpServers\" : { \"trello\" : { \"command\" : \"node\" , \"args\" : [ \"/absolute/path/to/trello-mcp-service/dist/mcp-server/index.js\" ], \"env\" : { \"TRELLO_API_KEY\" : \"...\" , \"TRELLO_API_TOKEN\" : \"...\" } } } } { \"mcpServers\" : { \"trello\" : { \"command\" : \"node\" , \"args\" : [ \"/absolute/path/to/trello-mcp-service/dist/mcp-server/index.js\" ], \"env\" : { \"TRELLO_API_KEY\" : \"...\" , \"TRELLO_API_TOKEN\" : \"...\" } } } } Restart Claude Desktop. All 67 Trello tools become available in any conversation. Ask &quot;create a card on Roadmap called Buy milk&quot; and Claude will discover create_card , fill the arguments, and return the result inline as a tool-use turn. The same goes for the read-side tools: &quot;what's on my board?&quot; produces a list_boards + list_cards_on_board chain without any extra prompting. 8.2 With the MCP Inspector npx @modelcontextprotocol/inspector node dist/mcp-server/index.js npx @modelcontextprotocol/inspector node dist/mcp-server/index.js The Inspector opens a browser UI where you can browse the tool catalog, read schemas, and call tools manually. It is the fastest way to verify tool behavior without involving an LLM, and the right place to debug a failing tool before you suspect the model. Part 9: Customizing and extending 9.1 Add a new Trello tool Open src/mcp-server/tools/ and pick the file matching the resource (e.g. cards.ts ). Add a new def(...) registration with a name, description, zod schema, and async handler. Rebuild the container: docker compose up --build . The new tool is picked up automatically. There is no separate registration step. 9.2 Swap to a different SaaS API The project is a clean reference for any REST-backed SaaS. To fork it: Replace src/mcp-server/trello/ with a client for your target API (Linear, GitHub Issues, Notion, etc.). Replace the tool registrations under src/mcp-server/tools/ with your new operations. Everything else stays the same: the agent loop, Telegram wiring, history management, and Docker setup. The whole codebase is roughly 1,900 lines of TypeScript across 35 files. 9.3 Tunable knobs All behavior is env-var driven. Useful ones: Var Default Purpose MAX_TURNS 16 Max chained tool calls per user message. TOOL_OUTPUT_CHAR_BUDGET 16000 Tool output truncation before entering history. OLLAMA_TIMEOUT_MS 120000 Per-call abort timeout for Ollama. OLLAMA_MODEL qwen3-coder:latest Any tool-calling-capable model. Part 10: Troubleshooting Symptom Cause and fix Bot starts but never replies, no errors in logs On Apple Silicon, node:20-alpine runs under Rosetta and Node's TLS hangs on api.telegram.org . The project uses node:20-slim to avoid this. If you forked back to alpine, switch back. Not authorized reply in Telegram TELEGRAM_ALLOWED_USER_IDS must contain your numeric id, not your @handle . Send /whoami to the bot to see what id Telegram reports for you. 401 Unauthorized from Trello The Secret on the Power-Up API key page is not the Token. Click the blue Token link, authorize, and use that string. I hit my tool-call limit A multi-step request exceeded MAX_TURNS=16 . Bump it via env or break the request into smaller asks. Frequent hits often mean the model is looping; try a stronger model. First reply takes 20–60 seconds Ollama cold-loads the model into VRAM on the first request. Subsequent calls are normal-speed. Pre-warm with a curl to /api/generate if you want the first user-facing reply to be fast. Bot can reach internet but not Ollama If Ollama runs on the host, set OLLAMA_HOST=http://host.docker.internal:11434 . The included docker-compose.yml has the extra_hosts mapping needed for Linux. Result End-to-end on a warm model: roughly 1.5 seconds per reply. Cold start: 20–60 seconds for the first turn while Ollama loads weights into VRAM. The MCP server exposes 67 tools, from create_card to list_cards_due_soon to set_card_cover . Because it speaks plain MCP, plugging it into Claude Desktop is a four-line config addition. Forking it for a different SaaS is roughly two evenings of work for a comparable surface. Source Full source, README, architecture diagram, and the complete 67-tool inventory: github.com/devdaviddr/trello-mcp-service ."
  },
  {
    "slug": "2026-05-10-multiple-apps-docker-networks-cloudflare-tunnels",
    "title": "Running Multiple Apps with Traefik, Docker, and Cloudflare Tunnels",
    "description": "How to host five apps on one Mac Mini without cramming them into a single compose file. One shared network, Traefik for routing and TLS, one Cloudflare tunnel.",
    "tags": [
      "self-hosting",
      "docker",
      "traefik",
      "cloudflare",
      "infrastructure"
    ],
    "excerpt": "After the Mac Mini setup from the last post, I got greedy. The M4 was idling at 15% CPU with 8 GB of RAM untouched, the Cloudflare tunnel was already wired up, and the marginal cost of another app was effectively zero. So I started adding more. Five ",
    "content": "After the Mac Mini setup from the last post, I got greedy. The M4 was idling at 15% CPU with 8 GB of RAM untouched, the Cloudflare tunnel was already wired up, and the marginal cost of another app was effectively zero. So I started adding more. Five apps later, the lesson is straightforward: you don't need to cram them into one giant compose file. Each app gets its own directory, its own compose file, its own database. They share one Docker network, one reverse proxy (Traefik), and one Cloudflare tunnel. Routing and TLS are declared with Docker labels on each app, so there's no central config file to keep in sync. That's the whole pattern. Don't do this The temptation, when you start, is to put everything in a single docker-compose.yml : services : app1-frontend : ... app1-backend : ... app1-db : ... app2-frontend : ... app2-backend : ... app2-db : ... services : app1-frontend : ... app1-backend : ... app1-db : ... app2-frontend : ... app2-backend : ... app2-db : ... It works until it doesn't. One service crashes and you restart the whole stack. You update one app and risk breaking the other three. Ports collide, dependencies tangle, the file balloons. The blast radius of any change is the entire system. The pattern Mac Mini M4 | v Cloudflare Tunnel | v Traefik (proxy + TLS) | ┌───────┬──────┼──────┬───────┐ v v v v v App1 App2 App3 App4 ... (own (own (own (own stack) stack) stack) stack) One shared web Docker network connects every app to Traefik. Traefik watches the Docker socket and discovers routes from labels on each container, so there's no central routing file. Cloudflare hands inbound traffic to Traefik, which decides which app gets it. Each app's database stays on its own internal network, unreachable from anywhere else. File layout: ~/apps/ ├── portfolio/ │ ├── docker-compose.yml │ ├── frontend/ │ ├── backend/ │ └── .env ├── api-project/ │ ├── docker-compose.yml │ ├── api/ │ └── .env └── shared/ └── traefik/ ├── docker-compose.yml └── .env Step 1: the shared network docker network create web docker network create web That's the entire setup step. The network persists across container restarts and reboots. Every app's compose file references it as external. Step 2: an app A typical app compose file. Note the labels on each service Traefik should expose; the database has none, so Traefik never sees it. services : frontend : build : ./frontend container_name : portfolio-frontend networks : [ web , portfolio-internal ] labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio.rule=Host(`portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio.entrypoints=web\" - \"traefik.http.services.portfolio.loadbalancer.server.port=80\" restart : unless-stopped backend : build : ./backend container_name : portfolio-backend env_file : .env networks : [ web , portfolio-internal ] labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio-api.rule=Host(`api.portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio-api.entrypoints=web\" - \"traefik.http.services.portfolio-api.loadbalancer.server.port=4000\" restart : unless-stopped db : image : postgres:16-alpine container_name : portfolio-db env_file : .env volumes : - portfolio-db:/var/lib/postgresql/data networks : [ portfolio-internal ] restart : unless-stopped networks : web : external : true portfolio-internal : volumes : portfolio-db : services : frontend : build : ./frontend container_name : portfolio-frontend networks : [ web , portfolio-internal ] labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio.rule=Host(`portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio.entrypoints=web\" - \"traefik.http.services.portfolio.loadbalancer.server.port=80\" restart : unless-stopped backend : build : ./backend container_name : portfolio-backend env_file : .env networks : [ web , portfolio-internal ] labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio-api.rule=Host(`api.portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio-api.entrypoints=web\" - \"traefik.http.services.portfolio-api.loadbalancer.server.port=4000\" restart : unless-stopped db : image : postgres:16-alpine container_name : portfolio-db env_file : .env volumes : - portfolio-db:/var/lib/postgresql/data networks : [ portfolio-internal ] restart : unless-stopped networks : web : external : true portfolio-internal : volumes : portfolio-db : The two-networks-per-service trick is what makes this clean. The frontend and backend join web so Traefik can reach them. The database stays on portfolio-internal only, so nothing outside the app can talk to it. Each app is an island with one bridge to the proxy. Bring it up: cd ~/apps/portfolio docker compose up -d cd ~/apps/portfolio docker compose up -d A second app is the same pattern. New directory, its own compose file, its own internal network name, its own router labels. Repeat as needed. Step 3: Traefik ~/apps/shared/traefik/docker-compose.yml : services : traefik : image : traefik:v3.1 container_name : traefik command : - \"--providers.docker=true\" - \"--providers.docker.exposedbydefault=false\" - \"--providers.docker.network=web\" - \"--entrypoints.web.address=:80\" ports : - \"80:80\" volumes : - /var/run/docker.sock:/var/run/docker.sock:ro networks : [ web ] restart : unless-stopped networks : web : external : true services : traefik : image : traefik:v3.1 container_name : traefik command : - \"--providers.docker=true\" - \"--providers.docker.exposedbydefault=false\" - \"--providers.docker.network=web\" - \"--entrypoints.web.address=:80\" ports : - \"80:80\" volumes : - /var/run/docker.sock:/var/run/docker.sock:ro networks : [ web ] restart : unless-stopped networks : web : external : true A few things to notice. exposedbydefault=false means Traefik ignores any container that doesn't explicitly set traefik.enable=true , so a stray container can't accidentally appear on the public internet. The Docker socket is mounted read-only because Traefik only needs to watch it. And there's no central routing config file at all; Traefik builds the router table from labels on running containers and updates it live as you start and stop services. Bring it up: cd ~/apps/shared/traefik docker compose up -d cd ~/apps/shared/traefik docker compose up -d At this point, the apps are reachable on localhost:80 with the right Host header. Time to give them real TLS. Step 4: real SSL via Cloudflare DNS Cloudflare's edge already terminates TLS for users with their own certificate, and the tunnel from edge to your Mac is encrypted. So strictly, Traefik doesn't need its own certs to be safe. There's still a good case for issuing real Let's Encrypt certs at the origin: it gives you defense in depth (the cloudflared-to-Traefik hop is also TLS), it lets you flip Cloudflare into &quot;Full (strict)&quot; SSL mode, and it future-proofs the setup if you ever expose a service outside the tunnel. The wrinkle when you're behind a tunnel is that the HTTP-01 challenge can't reach you. Cloudflare DNS points at the tunnel, not at your home IP, and there's no public port for Let's Encrypt to hit. The DNS-01 challenge works fine, though: it just writes a TXT record. Traefik supports DNS-01 against the Cloudflare API natively. First, create a scoped API token. In the Cloudflare dashboard: My Profile → API Tokens → Create Token → &quot;Edit zone DNS&quot; template . Scope it to the specific zone(s) you're using. Save the token in ~/apps/shared/traefik/.env : CF_DNS_API_TOKEN=&lt;the token&gt; Then update the Traefik compose file to add the HTTPS entrypoint, the ACME resolver, and a place to persist certs: services : traefik : image : traefik:v3.1 container_name : traefik command : - \"--providers.docker=true\" - \"--providers.docker.exposedbydefault=false\" - \"--providers.docker.network=web\" - \"--entrypoints.web.address=:80\" - \"--entrypoints.websecure.address=:443\" - \"--certificatesresolvers.cloudflare.acme.dnschallenge=true\" - \"--certificatesresolvers.cloudflare.acme.dnschallenge.provider=cloudflare\" - \"--certificatesresolvers.cloudflare.acme.email=you@yourdomain.com\" - \"--certificatesresolvers.cloudflare.acme.storage=/letsencrypt/acme.json\" env_file : .env ports : - \"80:80\" - \"443:443\" volumes : - /var/run/docker.sock:/var/run/docker.sock:ro - letsencrypt:/letsencrypt networks : [ web ] restart : unless-stopped networks : web : external : true volumes : letsencrypt : services : traefik : image : traefik:v3.1 container_name : traefik command : - \"--providers.docker=true\" - \"--providers.docker.exposedbydefault=false\" - \"--providers.docker.network=web\" - \"--entrypoints.web.address=:80\" - \"--entrypoints.websecure.address=:443\" - \"--certificatesresolvers.cloudflare.acme.dnschallenge=true\" - \"--certificatesresolvers.cloudflare.acme.dnschallenge.provider=cloudflare\" - \"--certificatesresolvers.cloudflare.acme.email=you@yourdomain.com\" - \"--certificatesresolvers.cloudflare.acme.storage=/letsencrypt/acme.json\" env_file : .env ports : - \"80:80\" - \"443:443\" volumes : - /var/run/docker.sock:/var/run/docker.sock:ro - letsencrypt:/letsencrypt networks : [ web ] restart : unless-stopped networks : web : external : true volumes : letsencrypt : Then update each exposed service in the app compose files to use the websecure entrypoint and the Cloudflare resolver: labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio.rule=Host(`portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio.entrypoints=websecure\" - \"traefik.http.routers.portfolio.tls.certresolver=cloudflare\" - \"traefik.http.services.portfolio.loadbalancer.server.port=80\" labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio.rule=Host(`portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio.entrypoints=websecure\" - \"traefik.http.routers.portfolio.tls.certresolver=cloudflare\" - \"traefik.http.services.portfolio.loadbalancer.server.port=80\" The first time Traefik sees a router with tls.certresolver=cloudflare , it asks Let's Encrypt for a cert, Let's Encrypt asks for a TXT record under _acme-challenge.portfolio.yourdomain.com , Traefik writes it via the Cloudflare API, Let's Encrypt verifies, and the cert lands in /letsencrypt/acme.json . Renewals happen on their own. You don't think about it again. Restart the proxy so it picks up the new args: cd ~/apps/shared/traefik docker compose up -d cd ~/apps/shared/traefik docker compose up -d Watch the logs the first time; cert issuance takes a few seconds and you'll see Traefik report success. Step 5: the tunnel Now point the tunnel at HTTPS on Traefik. ~/.cloudflared/config.yml : tunnel : &#x3C;your-tunnel-id> credentials-file : /Users/&#x3C;username>/.cloudflared/&#x3C;tunnel-id>.json ingress : - hostname : portfolio.yourdomain.com service : https://localhost:443 - hostname : api.portfolio.yourdomain.com service : https://localhost:443 - hostname : api.yourdomain.com service : https://localhost:443 - service : http_status:404 tunnel : &#x3C;your-tunnel-id> credentials-file : /Users/&#x3C;username>/.cloudflared/&#x3C;tunnel-id>.json ingress : - hostname : portfolio.yourdomain.com service : https://localhost:443 - hostname : api.portfolio.yourdomain.com service : https://localhost:443 - hostname : api.yourdomain.com service : https://localhost:443 - service : http_status:404 Every hostname goes to https://localhost:443 . That's Traefik. Traefik reads the Host header and forwards to the right container on the web network. Cloudflare doesn't need to know about your app topology; Traefik owns that. Add the DNS records and reload the tunnel: cloudflared tunnel route dns &#x3C; tunnel-i d > portfolio.yourdomain.com cloudflared tunnel route dns &#x3C; tunnel-i d > api.portfolio.yourdomain.com cloudflared tunnel route dns &#x3C; tunnel-i d > api.yourdomain.com sudo launchctl kickstart -k system/com.cloudflare.cloudflared cloudflared tunnel route dns &#x3C; tunnel-i d > portfolio.yourdomain.com cloudflared tunnel route dns &#x3C; tunnel-i d > api.portfolio.yourdomain.com cloudflared tunnel route dns &#x3C; tunnel-i d > api.yourdomain.com sudo launchctl kickstart -k system/com.cloudflare.cloudflared In the Cloudflare dashboard, you can now switch SSL/TLS mode for the zone to Full (strict) . Edge-to-origin traffic is verified end to end. How a request flows A request to portfolio.yourdomain.com lands at Cloudflare's edge, gets TLS terminated against Cloudflare's cert, travels down the encrypted outbound tunnel to https://localhost:443 on the Mac, hits Traefik (which presents the Let's Encrypt cert it issued for that hostname), and Traefik forwards the decrypted request to the portfolio-frontend container on the web network. The response retraces the path. From the app's point of view, it's just receiving a plain HTTP request from a sibling container. Day-to-day Each app is independent: # update portfolio cd ~/apps/portfolio &#x26;&#x26; docker compose pull &#x26;&#x26; docker compose up -d # tail one app's logs cd ~/apps/api-project &#x26;&#x26; docker compose logs -f # restart one service cd ~/apps/portfolio &#x26;&#x26; docker compose restart backend # see what Traefik thinks is routable docker logs traefik | grep -i router # update portfolio cd ~/apps/portfolio &#x26;&#x26; docker compose pull &#x26;&#x26; docker compose up -d # tail one app's logs cd ~/apps/api-project &#x26;&#x26; docker compose logs -f # restart one service cd ~/apps/portfolio &#x26;&#x26; docker compose restart backend # see what Traefik thinks is routable docker logs traefik | grep -i router Adding a new app: create the directory, write a compose file that joins web and declares its router labels, docker compose up -d . Traefik picks it up within a second or two and (if you set tls.certresolver=cloudflare ) issues a cert on the spot. Add the tunnel route. Five minutes start to finish. Resource usage What's currently running on the base Mac Mini M4: Portfolio site (React + Express + Postgres) Side project API (Node + Redis + Postgres) Personal dashboard (Vue + SQLite) Internal tool (Python FastAPI, no database) A staging environment that comes and goes Steady state: 15 to 20% CPU, around 8 of the 16 GB of RAM in use, 45 of the 256 GB SSD. Plenty of headroom. When one big compose file is still right Use a single compose file when the services are one logical app and ship together (a frontend pinned to a specific backend version, for example). Use multiple compose files when the apps are independent projects with their own update schedules. For self-hosting a portfolio of side projects, multiple compose files win every time. A few things worth knowing Router names must be unique across the whole proxy. Traefik discovers routers by label, not by container, so two services labeled traefik.http.routers.api... will fight. Prefix with the app name ( portfolio-api , dashboard-api , etc.). Lock down exposedbydefault=false . Without it, every container on the web network gets auto-published. With it, only services that explicitly set traefik.enable=true are exposed. Treat this as non-negotiable. One .env per app. No shared secrets across projects. If one leaks, the blast radius is exactly one app. The Cloudflare API token lives in the Traefik directory only. Watch disk usage. Multiple databases and a growing collection of images add up faster than you expect. docker system df once a week, docker system prune -a once a month. Stagger backups. Five Postgres dumps at midnight is a real IO spike. Spread them across the early-morning hours. Persist acme.json . It holds your certs and account key. The letsencrypt named volume in the compose file above does this; if you blow it away, Traefik re-issues everything from scratch and you can hit Let's Encrypt rate limits. Closing The point of this setup isn't density. It's that the second app costs nothing in operational complexity, and the third costs even less. Traefik handles routing and TLS without a config file you have to remember to update. You stop thinking about whether a project is &quot;worth&quot; hosting. You build it, drop it in ~/apps/ , label the router, and move on. The tunnel doesn't care, the proxy doesn't care, and the Mac Mini definitely doesn't care."
  },
  {
    "slug": "2026-05-10-self-hosting-mac-mini-cloudflare-tunnels",
    "title": "Self-Hosting a Full-Stack App on a Mac Mini M4 with Cloudflare Tunnels",
    "description": "How I moved a React + Express + Postgres app off an $11/mo VPS onto a $599 Mac Mini, with Cloudflare Tunnels handling the public-facing parts.",
    "tags": [
      "self-hosting",
      "infrastructure",
      "cloudflare",
      "docker"
    ],
    "excerpt": "I was paying $11 a month for a VPS to host what amounted to a small React frontend, an Express API, and a Postgres database. $132 a year isn't ruinous, but the value side was thin: shared CPU cores that throttle under load, 1 GB of RAM that fills up ",
    "content": "I was paying $11 a month for a VPS to host what amounted to a small React frontend, an Express API, and a Postgres database. $132 a year isn't ruinous, but the value side was thin: shared CPU cores that throttle under load, 1 GB of RAM that fills up fast, 25 GB of disk that fills up faster. A side project pulling 100 visitors a day shouldn't need any of that babysitting. Last month I bought a base Mac Mini M4 for $599. I moved the whole stack onto it and put it on the public internet through Cloudflare Tunnels. No port forwarding, no exposed home IP, no firewall rules to maintain. It runs me about $2 a month in electricity. Here's how it's wired up. Why the Mac Mini holds up The base Mac Mini M4 is a real server. 10-core CPU, 16 GB of unified memory, 256 GB SSD. It idles around 10 watts and barely touches 30 under load. Compare that to an $11 VPS, where your &quot;1 vCPU&quot; is a fraction of someone else's processor and gets throttled the moment a noisy neighbor needs it. Apple Silicon does the heavy lifting. The M4 runs Docker containers cool and quiet. With my Vite build watcher, the Express API, Postgres, and Redis all running, Activity Monitor barely registers a blip. No fan ramps. No thermal throttling. And you own it. No surprise tier increases, no deprecation emails, no terms-of-service rewrites pointed at your project. The pricing path VPS providers solve real problems. The pricing path is the issue. You start at $5, bump to $11 when your app needs more RAM, add $2 for backups, then $11 for a staging environment, then more disk. Every step costs more for resources that are still shared and still constrained. For a side project, the math rarely works in your favor. You don't need someone else's slice of a server. You need your app to run reliably and cheaply. A Mac Mini does that. The architecture Internet Users | v ┌────────────────┐ │ Cloudflare │ │ Edge Network │ │ (SSL, DDoS) │ └────────┬───────┘ | Encrypted Tunnel (outbound) | v ┌────────────────┐ │ Mac Mini M4 │ │ (Your Home) │ └────────┬───────┘ | ┌───────────┴───────────┐ | | ┌────v─────┐ ┌─────v────┐ │ nginx │ │ Express │ │ (React) │ │ API │ │ :3000 │ │ :4000 │ └──────────┘ └─────┬────┘ | ┌─────v─────┐ │PostgreSQL │ │ Database │ └───────────┘ All in Docker containers Cloudflare Tunnels is what makes this safe. The Mac Mini opens an outbound connection to Cloudflare's edge. All inbound traffic flows through Cloudflare, picks up SSL and DDoS protection on the way, and is forwarded down the tunnel to the box. The Mac never accepts an inbound connection. There's nothing to port-forward, nothing to expose, no firewall hole to leave open. Docker setup Install Docker via Homebrew: brew install --cask docker brew install --cask docker Project layout: ~/my-app/ ├── docker-compose.yml ├── frontend/ # Vite-built React app (static dist/) ├── backend/ # Express API └── nginx.conf # static-file config for the frontend container The compose file ties it together: services : frontend : image : nginx:alpine volumes : - ./frontend/dist:/usr/share/nginx/html - ./nginx.conf:/etc/nginx/nginx.conf:ro ports : - \"3000:80\" restart : unless-stopped backend : build : ./backend env_file : .env ports : - \"4000:4000\" depends_on : - db restart : unless-stopped db : image : postgres:16-alpine env_file : .env volumes : - postgres_data:/var/lib/postgresql/data restart : unless-stopped volumes : postgres_data : services : frontend : image : nginx:alpine volumes : - ./frontend/dist:/usr/share/nginx/html - ./nginx.conf:/etc/nginx/nginx.conf:ro ports : - \"3000:80\" restart : unless-stopped backend : build : ./backend env_file : .env ports : - \"4000:4000\" depends_on : - db restart : unless-stopped db : image : postgres:16-alpine env_file : .env volumes : - postgres_data:/var/lib/postgresql/data restart : unless-stopped volumes : postgres_data : A couple of notes on this. The old version: '3.8' field is gone; Compose v2 ignores it. Secrets live in a .env file, not in the compose file itself. Three containers, around 500 MB of memory between them. The 16 GB Mini handles this with plenty of headroom for whatever you want to run alongside it. A minimal Express API: const express = require ( 'express' ); const { Pool } = require ( 'pg' ); const app = express (); const pool = new Pool ({ connectionString: process.env. DATABASE_URL }); app. use (express. json ()); app. get ( '/api/health' , ( _req , res ) => res. json ({ status: 'ok' })); app. get ( '/api/data' , async ( _req , res ) => { const { rows } = await pool. query ( 'SELECT * FROM items' ); res. json (rows); }); app. listen ( 4000 , () => console. log ( 'API on :4000' )); const express = require ( 'express' ); const { Pool } = require ( 'pg' ); const app = express (); const pool = new Pool ({ connectionString: process.env. DATABASE_URL }); app. use (express. json ()); app. get ( '/api/health' , ( _req , res ) => res. json ({ status: 'ok' })); app. get ( '/api/data' , async ( _req , res ) => { const { rows } = await pool. query ( 'SELECT * FROM items' ); res. json (rows); }); app. listen ( 4000 , () => console. log ( 'API on :4000' )); The React frontend is a standard Vite build. Nothing exotic. Bring it up: docker compose up -d docker compose up -d Frontend on :3000 , API on :4000 . Local only, for now. Wiring up Cloudflare Tunnels cloudflared is a small daemon that runs on the Mac and holds an outbound connection to Cloudflare. No DNS gymnastics. No certificate renewal. No firewall rules. brew install cloudflare/cloudflare/cloudflared cloudflared tunnel login cloudflared tunnel create my-app brew install cloudflare/cloudflare/cloudflared cloudflared tunnel login cloudflared tunnel create my-app The login step opens a browser to pick a domain you've added to Cloudflare. No domain? Cloudflare will hand out a free *.trycloudflare.com subdomain for testing. Create ~/.cloudflared/config.yml : tunnel : &#x3C;your-tunnel-id> credentials-file : /Users/&#x3C;username>/.cloudflared/&#x3C;tunnel-id>.json ingress : - hostname : myapp.yourdomain.com service : http://localhost:3000 - hostname : api.myapp.yourdomain.com service : http://localhost:4000 - service : http_status:404 tunnel : &#x3C;your-tunnel-id> credentials-file : /Users/&#x3C;username>/.cloudflared/&#x3C;tunnel-id>.json ingress : - hostname : myapp.yourdomain.com service : http://localhost:3000 - hostname : api.myapp.yourdomain.com service : http://localhost:4000 - service : http_status:404 Route DNS through the tunnel: cloudflared tunnel route dns my-app myapp.yourdomain.com cloudflared tunnel route dns my-app api.myapp.yourdomain.com cloudflared tunnel route dns my-app myapp.yourdomain.com cloudflared tunnel route dns my-app api.myapp.yourdomain.com Run it: cloudflared tunnel run my-app cloudflared tunnel run my-app The app is now on the public internet behind Cloudflare's edge: HTTPS, DDoS protection, no exposed IP. To make it survive reboots: sudo cloudflared service install sudo launchctl start com.cloudflare.cloudflared sudo cloudflared service install sudo launchctl start com.cloudflare.cloudflared The cost picture Mac Mini M4 (one-time): $599. Electricity at typical residential rates: roughly $2 a month. A domain is optional ($12 a year if you want one; the trycloudflare.com subdomain is free). A comparable VPS, once you add backups and a staging tier, lands around $200 to $300 a year. The Mini pays itself off in roughly two years. After that you're paying for power. And what you get for the same money is 10 CPU cores and 16 GB of memory instead of fractions of a shared core. Cloudflare Tunnels itself is on the free tier. Edge SSL, DDoS protection, and a global CDN at zero marginal cost. What you're trading off The Mac has to stay powered and online. A power blip or ISP outage takes you down. For side-project traffic that's usually fine, and a small UPS handles the brownouts. If you need four nines, this isn't the play. Back up Postgres. A nightly cron that dumps to an external drive (and optionally uploads to cheap object storage) is enough for most setups. Watch the disk. 256 GB is plenty until Docker images quietly accumulate. docker system prune -a once a month keeps it honest. Patch the system. brew upgrade cloudflared , pull fresh base images, restart the stack. Should be a 10-minute job, not a quarterly project. When this isn't the right call If your traffic is high, globally distributed, or genuinely needs HA, stay with the cloud. The Mac Mini is for the long tail: side projects, small businesses, internal tools, personal apps that have no business paying $20 a month for managed infrastructure. That's a lot of projects. Probably most of yours. What you get back, beyond the savings, is visibility. The box is on your desk. The logs are in your terminal. There are three containers, and you can see all of them. The whole stack fits in your head, which means you actually know what's running. Closing Self-hosting in 2026 isn't the chore it was a decade ago. The hardware is small, quiet, and cheap. The tools, Docker and Cloudflare Tunnels, hide the parts that used to be painful. You don't need a data center to run a real app. You need a Mac Mini and a tunnel."
  },
  {
    "slug": "2026-01-19-building-a-scalable-pdf-ai-analysis-pipeline",
    "title": "Building a Scalable PDF AI Analysis Pipeline with Python Microservices, Docker, Groq, and RabbitMQ",
    "description": "PDF analysis pipeline built with Python microservices, Docker, RabbitMQ, and Groq AI for scalable document processing and analysis.",
    "tags": [
      "ai",
      "python",
      "microservices",
      "docker",
      "rabbitmq",
      "groq",
      "pdf processing"
    ],
    "excerpt": "For developers and engineering teams, PDF documents represent a massive repository of critical information — technical specifications, research papers, financial reports, legal contracts, and customer submissions. However, PDFs are essentially locked",
    "content": "For developers and engineering teams, PDF documents represent a massive repository of critical information — technical specifications, research papers, financial reports, legal contracts, and customer submissions. However, PDFs are essentially locked boxes of data. Unlike structured databases or searchable codebases, you cannot query, aggregate, or analyze hundreds of PDFs simultaneously without manual effort. We face a fundamental bottleneck: document quantity versus extraction capacity. As organizations accumulate thousands of PDFs, the gap between having information and actually leveraging it grows exponentially. The Problem: The Document Processing Bottleneck Traditional approaches to PDF processing create immediate friction. The challenges are architectural: Synchronous Blocking — Users upload a document and wait while a single-threaded process extracts text, calls an AI API, and returns results. One slow PDF blocks everything behind it. Resource Mismatch — Text extraction is CPU-intensive. AI inference is network-bound. Storage operations are I/O-heavy. Running these on a single server wastes resources during each stage. Poor User Experience — Without async processing, users stare at loading spinners for minutes, unsure if the system crashed or is still working. The Solution: An Event-Driven Microservices Pipeline We are going to build a production-grade pipeline that decouples each processing stage into independent, scalable services. By leveraging Docker, RabbitMQ, Groq's inference API, and Streamlit, we will create a system that handles concurrent PDF uploads, processes them asynchronously, and delivers results through a polished web interface. The core innovation here is RabbitMQ message queuing. Rather than chaining services together synchronously, each service publishes events that downstream services consume. This pattern enables horizontal scaling, fault tolerance, and independent deployment cycles. Core Architecture The pipeline orchestrates six specialized microservices through an event-driven workflow: Streamlit UI — Users upload PDFs, select analysis types, and view real-time progress without page refreshes. API Gateway (FastAPI) — Accepts HTTP uploads, generates job IDs, and returns immediately while processing happens asynchronously. PDF Ingestion — Validates files, extracts metadata, stores PDFs in MinIO object storage, and publishes to the pdf.uploaded queue. Text Extractor — Consumes upload events, extracts text with PyPDF2/pdfplumber, handles OCR fallbacks, and publishes to the text.ready queue. AI Analyzer (Groq) — Consumes text events, sends content to Groq's Llama 3.1 or Mixtral models for summarization/classification/Q&amp;A generation, and publishes to the analysis.done queue. Results Handler — Consumes analysis events, persists results to PostgreSQL, caches in Redis, and triggers webhooks for external integrations. Here is the complete architecture: ┌─────────────────────────────────────────────────────────────────────────────┐ │ PDF AI ANALYSIS PIPELINE ARCHITECTURE │ │ (Python Microservices + Docker + RabbitMQ + Groq) │ │ WITH STREAMLIT FRONTEND │ └─────────────────────────────────────────────────────────────────────────────┘ ┌──────────────────┐ │ STREAMLIT UI │ (Web Frontend) │ [Docker:8501] │ - File upload interface └────────┬─────────┘ - Real-time status dashboard │ - Results visualization │ HTTP POST/GET │ ▼ ┌─────────────────┐ │ API Gateway │ (FastAPI) │ [Docker:8000] │ - REST endpoints └────────┬────────┘ - Job management │ - SSE for real-time updates │ │ HTTP POST /analyze │ ▼ ┌────────────────────┐ │ PDF Ingestion │ (Python Service) │ Microservice │ - Validates PDF │ [Docker:8001] │ - Extracts metadata └─────────┬──────────┘ - Stores in MinIO │ │ Publish: pdf.uploaded │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ RABBITMQ MESSAGE BROKER │ │ [Docker:5672] │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ pdf.uploaded │ │ text.ready │ │analysis.done │ │ │ │ Queue │ │ Queue │ │ Queue │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ │ ┌──────▼──────┐ ┌───────▼────────┐ ┌─────▼──────┐ │ PDF Text │ │ AI Analysis │ │ Results │ │ Extractor │ │ Microservice │ │ Handler │ │ Service │ │ (Groq) │ │ Service │ │[Docker:8002]│ │ [Docker:8003] │ │[Docker:8004]│ └──────┬──────┘ └────────┬───────┘ └─────┬──────┘ │ │ │ │ - PyPDF2 │ - Groq API │ - Store results │ - pdfplumber │ - Llama 3 │ - PostgreSQL │ - OCR │ - Summarization │ - Redis cache │ │ - Classification │ - Webhooks │ │ │ │ Publish: │ Publish: │ │ text.ready │ analysis.done │ │ │ │ └────────────────────┴──────────────────┘ │ ▼ ┌──────────────────┐ │ Data Storage │ │ │ │ - PostgreSQL │ [Docker:5432] │ - MinIO/S3 │ [Docker:9000] │ - Redis Cache │ [Docker:6379] └──────────────────┘ Message Flow User uploads PDF via Streamlit UI → API Gateway Ingestion service validates → publishes to pdf.uploaded queue Text Extractor consumes → extracts text → publishes to text.ready queue AI Analyzer consumes → calls Groq API → publishes to analysis.done queue Results Handler consumes → stores results → notifies user Streamlit polls API Gateway → displays real-time progress → shows results Part 1: The Infrastructure We will start with the foundational layer: orchestrating services with Docker Compose and designing a database schema that supports job tracking, result storage, and caching. Folder Structure Treat this as a monorepo. Create the following directory tree: mkdir pdf-ai-pipeline cd pdf-ai-pipeline mkdir services database mkdir services/streamlit-ui services/api-gateway services/pdf-ingestion mkdir services/text-extractor services/ai-analyzer services/results-handler touch docker-compose.yml .env database/init.sql mkdir pdf-ai-pipeline cd pdf-ai-pipeline mkdir services database mkdir services/streamlit-ui services/api-gateway services/pdf-ingestion mkdir services/text-extractor services/ai-analyzer services/results-handler touch docker-compose.yml .env database/init.sql The Docker Compose File We need to orchestrate eight core services: Streamlit UI — the frontend users interact with. API Gateway (FastAPI) — the HTTP entry point for uploads and queries. PDF Ingestion, Text Extractor, AI Analyzer, Results Handler — the processing pipeline. RabbitMQ — message broker for event-driven communication. PostgreSQL — persistent storage for jobs and results. Redis — fast caching layer for frequently accessed results. MinIO — S3-compatible object storage for raw PDFs. Open docker-compose.yml and add this configuration: version : '3.8' services : # Frontend streamlit-ui : build : ./services/streamlit-ui container_name : pdf_ui ports : - \"8501:8501\" environment : - API_URL=http://api-gateway:8000 networks : - pdf-net depends_on : - api-gateway # API Gateway api-gateway : build : ./services/api-gateway container_name : pdf_api ports : - \"8000:8000\" environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - REDIS_URL=redis://redis:6379 - POSTGRES_URL=postgresql://admin:secretpassword@postgres:5432/pdf_analysis - MINIO_URL=minio:9000 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy postgres : condition : service_healthy redis : condition : service_started # PDF Ingestion (2 replicas for load balancing) pdf-ingestion : build : ./services/pdf-ingestion environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - MINIO_URL=minio:9000 - MINIO_ACCESS_KEY=minioadmin - MINIO_SECRET_KEY=minioadmin networks : - pdf-net depends_on : rabbitmq : condition : service_healthy minio : condition : service_started deploy : replicas : 2 # Text Extractor (3 replicas - CPU intensive) text-extractor : build : ./services/text-extractor environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy deploy : replicas : 3 # AI Analyzer (2 replicas) ai-analyzer : build : ./services/ai-analyzer environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - GROQ_API_KEY=${GROQ_API_KEY} networks : - pdf-net depends_on : rabbitmq : condition : service_healthy deploy : replicas : 2 # Results Handler results-handler : build : ./services/results-handler environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - POSTGRES_URL=postgresql://admin:secretpassword@postgres:5432/pdf_analysis - REDIS_URL=redis://redis:6379 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy postgres : condition : service_healthy redis : condition : service_started # RabbitMQ with Management UI rabbitmq : image : rabbitmq:3.12-management container_name : pdf_queue ports : - \"5672:5672\" - \"15672:15672\" environment : - RABBITMQ_DEFAULT_USER=guest - RABBITMQ_DEFAULT_PASS=guest networks : - pdf-net healthcheck : test : [ \"CMD\" , \"rabbitmq-diagnostics\" , \"-q\" , \"ping\" ] interval : 10s timeout : 5s retries : 5 # PostgreSQL for results storage postgres : image : postgres:15 container_name : pdf_db environment : - POSTGRES_DB=pdf_analysis - POSTGRES_USER=admin - POSTGRES_PASSWORD=secretpassword ports : - \"5432:5432\" volumes : - postgres_data:/var/lib/postgresql/data - ./database/init.sql:/docker-entrypoint-initdb.d/init.sql networks : - pdf-net healthcheck : test : [ \"CMD-SHELL\" , \"pg_isready -U admin -d pdf_analysis\" ] interval : 10s timeout : 5s retries : 5 # Redis for caching redis : image : redis:7-alpine container_name : pdf_cache ports : - \"6379:6379\" networks : - pdf-net # MinIO (S3-compatible storage) minio : image : minio/minio container_name : pdf_storage ports : - \"9000:9000\" - \"9001:9001\" environment : - MINIO_ROOT_USER=minioadmin - MINIO_ROOT_PASSWORD=minioadmin command : server /data --console-address \":9001\" volumes : - minio_data:/data networks : - pdf-net volumes : postgres_data : minio_data : networks : pdf-net : driver : bridge version : '3.8' services : # Frontend streamlit-ui : build : ./services/streamlit-ui container_name : pdf_ui ports : - \"8501:8501\" environment : - API_URL=http://api-gateway:8000 networks : - pdf-net depends_on : - api-gateway # API Gateway api-gateway : build : ./services/api-gateway container_name : pdf_api ports : - \"8000:8000\" environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - REDIS_URL=redis://redis:6379 - POSTGRES_URL=postgresql://admin:secretpassword@postgres:5432/pdf_analysis - MINIO_URL=minio:9000 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy postgres : condition : service_healthy redis : condition : service_started # PDF Ingestion (2 replicas for load balancing) pdf-ingestion : build : ./services/pdf-ingestion environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - MINIO_URL=minio:9000 - MINIO_ACCESS_KEY=minioadmin - MINIO_SECRET_KEY=minioadmin networks : - pdf-net depends_on : rabbitmq : condition : service_healthy minio : condition : service_started deploy : replicas : 2 # Text Extractor (3 replicas - CPU intensive) text-extractor : build : ./services/text-extractor environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy deploy : replicas : 3 # AI Analyzer (2 replicas) ai-analyzer : build : ./services/ai-analyzer environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - GROQ_API_KEY=${GROQ_API_KEY} networks : - pdf-net depends_on : rabbitmq : condition : service_healthy deploy : replicas : 2 # Results Handler results-handler : build : ./services/results-handler environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - POSTGRES_URL=postgresql://admin:secretpassword@postgres:5432/pdf_analysis - REDIS_URL=redis://redis:6379 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy postgres : condition : service_healthy redis : condition : service_started # RabbitMQ with Management UI rabbitmq : image : rabbitmq:3.12-management container_name : pdf_queue ports : - \"5672:5672\" - \"15672:15672\" environment : - RABBITMQ_DEFAULT_USER=guest - RABBITMQ_DEFAULT_PASS=guest networks : - pdf-net healthcheck : test : [ \"CMD\" , \"rabbitmq-diagnostics\" , \"-q\" , \"ping\" ] interval : 10s timeout : 5s retries : 5 # PostgreSQL for results storage postgres : image : postgres:15 container_name : pdf_db environment : - POSTGRES_DB=pdf_analysis - POSTGRES_USER=admin - POSTGRES_PASSWORD=secretpassword ports : - \"5432:5432\" volumes : - postgres_data:/var/lib/postgresql/data - ./database/init.sql:/docker-entrypoint-initdb.d/init.sql networks : - pdf-net healthcheck : test : [ \"CMD-SHELL\" , \"pg_isready -U admin -d pdf_analysis\" ] interval : 10s timeout : 5s retries : 5 # Redis for caching redis : image : redis:7-alpine container_name : pdf_cache ports : - \"6379:6379\" networks : - pdf-net # MinIO (S3-compatible storage) minio : image : minio/minio container_name : pdf_storage ports : - \"9000:9000\" - \"9001:9001\" environment : - MINIO_ROOT_USER=minioadmin - MINIO_ROOT_PASSWORD=minioadmin command : server /data --console-address \":9001\" volumes : - minio_data:/data networks : - pdf-net volumes : postgres_data : minio_data : networks : pdf-net : driver : bridge Environment Variables Create a .env file to manage secrets: # Groq API Key (get one free at console.groq.com) GROQ_API_KEY = your_groq_api_key_here # Groq API Key (get one free at console.groq.com) GROQ_API_KEY = your_groq_api_key_here Designing the Schema We need to track job lifecycle, store analysis results, and cache frequently accessed data. Open database/init.sql : -- Jobs Table: Track processing lifecycle CREATE TABLE IF NOT EXISTS jobs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), filename TEXT NOT NULL, status TEXT DEFAULT 'pending', -- pending, extracting, analyzing, completed, failed analysis_type TEXT, -- summary, classification, qa_generation, full model TEXT, -- llama-3.1-70b, mixtral-8x7b created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Results Table: Store analysis output CREATE TABLE IF NOT EXISTS results ( id SERIAL PRIMARY KEY, job_id UUID REFERENCES jobs(id) ON DELETE CASCADE, result_data JSONB NOT NULL, -- Flexible storage for any AI output confidence_score FLOAT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Create indexes for fast queries CREATE INDEX idx_jobs_status ON jobs(status); CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC); CREATE INDEX idx_results_job_id ON results(job_id); -- Enable full-text search on results CREATE INDEX idx_results_data_gin ON results USING gin(result_data jsonb_path_ops); Booting Up Launch the infrastructure: # Start all services docker-compose up -d # Verify services are healthy docker-compose ps # View RabbitMQ Management UI at http://localhost:15672 (guest/guest) # View MinIO Console at http://localhost:9001 (minioadmin/minioadmin) # Start all services docker-compose up -d # Verify services are healthy docker-compose ps # View RabbitMQ Management UI at http://localhost:15672 (guest/guest) # View MinIO Console at http://localhost:9001 (minioadmin/minioadmin) Part 2: The Backend Services With infrastructure running, we will now build the four processing microservices that power the pipeline. Service 1: API Gateway (FastAPI) This service accepts HTTP uploads and immediately returns a job ID, enabling asynchronous processing. Create services/api-gateway/requirements.txt : fastapi uvicorn pika psycopg2-binary redis python-multipart python-dotenv The code ( services/api-gateway/main.py ): from fastapi import FastAPI, UploadFile, File, HTTPException from fastapi.responses import StreamingResponse import pika import psycopg2 import redis import json import os from uuid import uuid4 app = FastAPI( title = \"PDF Analysis API\" ) # Connect to Infrastructure RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) POSTGRES_URL = os.environ.get( \"POSTGRES_URL\" ) REDIS_URL = os.environ.get( \"REDIS_URL\" ) redis_client = redis.from_url( REDIS_URL ) def get_db (): return psycopg2.connect( POSTGRES_URL ) def publish_event (queue_name, message): connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = queue_name, durable = True ) channel.basic_publish( exchange = '' , routing_key = queue_name, body = json.dumps(message), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() @app.post ( \"/analyze\" ) async def analyze_pdf ( file: UploadFile = File( ... ), analysis_type: str = \"summary\" , model: str = \"llama-3.1-70b\" ): job_id = str (uuid4()) # Store job in database conn = get_db() cur = conn.cursor() cur.execute( \"INSERT INTO jobs (id, filename, status, analysis_type, model) VALUES ( %s , %s , 'pending', %s , %s )\" , (job_id, file .filename, analysis_type, model) ) conn.commit() conn.close() # Save file temporarily and publish to queue file_path = f \"/tmp/ { job_id } .pdf\" with open (file_path, \"wb\" ) as f: f.write( await file .read()) publish_event( \"pdf.uploaded\" , { \"job_id\" : job_id, \"file_path\" : file_path, \"filename\" : file .filename, \"analysis_type\" : analysis_type, \"model\" : model }) return { \"job_id\" : job_id, \"status\" : \"processing\" } @app.get ( \"/status/ {job_id} \" ) def get_status (job_id: str ): # Check cache first cached = redis_client.get( f \"job: { job_id } \" ) if cached: return json.loads(cached) # Query database conn = get_db() cur = conn.cursor() cur.execute( \"SELECT status, filename FROM jobs WHERE id = %s \" , (job_id,)) result = cur.fetchone() conn.close() if not result: raise HTTPException( status_code = 404 , detail = \"Job not found\" ) status_data = { \"job_id\" : job_id, \"status\" : result[ 0 ], \"filename\" : result[ 1 ]} redis_client.setex( f \"job: { job_id } \" , 60 , json.dumps(status_data)) return status_data @app.get ( \"/results/ {job_id} \" ) def get_results (job_id: str ): conn = get_db() cur = conn.cursor() cur.execute( \"SELECT result_data FROM results WHERE job_id = %s \" , (job_id,) ) result = cur.fetchone() conn.close() if not result: raise HTTPException( status_code = 404 , detail = \"Results not found\" ) return result[ 0 ] @app.get ( \"/metrics\" ) def get_metrics (): conn = get_db() cur = conn.cursor() cur.execute( \"SELECT status, COUNT(*) FROM jobs GROUP BY status\" ) metrics = dict (cur.fetchall()) conn.close() return metrics if __name__ == \"__main__\" : import uvicorn uvicorn.run(app, host = \"0.0.0.0\" , port = 8000 ) from fastapi import FastAPI, UploadFile, File, HTTPException from fastapi.responses import StreamingResponse import pika import psycopg2 import redis import json import os from uuid import uuid4 app = FastAPI( title = \"PDF Analysis API\" ) # Connect to Infrastructure RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) POSTGRES_URL = os.environ.get( \"POSTGRES_URL\" ) REDIS_URL = os.environ.get( \"REDIS_URL\" ) redis_client = redis.from_url( REDIS_URL ) def get_db (): return psycopg2.connect( POSTGRES_URL ) def publish_event (queue_name, message): connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = queue_name, durable = True ) channel.basic_publish( exchange = '' , routing_key = queue_name, body = json.dumps(message), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() @app.post ( \"/analyze\" ) async def analyze_pdf ( file: UploadFile = File( ... ), analysis_type: str = \"summary\" , model: str = \"llama-3.1-70b\" ): job_id = str (uuid4()) # Store job in database conn = get_db() cur = conn.cursor() cur.execute( \"INSERT INTO jobs (id, filename, status, analysis_type, model) VALUES ( %s , %s , 'pending', %s , %s )\" , (job_id, file .filename, analysis_type, model) ) conn.commit() conn.close() # Save file temporarily and publish to queue file_path = f \"/tmp/ { job_id } .pdf\" with open (file_path, \"wb\" ) as f: f.write( await file .read()) publish_event( \"pdf.uploaded\" , { \"job_id\" : job_id, \"file_path\" : file_path, \"filename\" : file .filename, \"analysis_type\" : analysis_type, \"model\" : model }) return { \"job_id\" : job_id, \"status\" : \"processing\" } @app.get ( \"/status/ {job_id} \" ) def get_status (job_id: str ): # Check cache first cached = redis_client.get( f \"job: { job_id } \" ) if cached: return json.loads(cached) # Query database conn = get_db() cur = conn.cursor() cur.execute( \"SELECT status, filename FROM jobs WHERE id = %s \" , (job_id,)) result = cur.fetchone() conn.close() if not result: raise HTTPException( status_code = 404 , detail = \"Job not found\" ) status_data = { \"job_id\" : job_id, \"status\" : result[ 0 ], \"filename\" : result[ 1 ]} redis_client.setex( f \"job: { job_id } \" , 60 , json.dumps(status_data)) return status_data @app.get ( \"/results/ {job_id} \" ) def get_results (job_id: str ): conn = get_db() cur = conn.cursor() cur.execute( \"SELECT result_data FROM results WHERE job_id = %s \" , (job_id,) ) result = cur.fetchone() conn.close() if not result: raise HTTPException( status_code = 404 , detail = \"Results not found\" ) return result[ 0 ] @app.get ( \"/metrics\" ) def get_metrics (): conn = get_db() cur = conn.cursor() cur.execute( \"SELECT status, COUNT(*) FROM jobs GROUP BY status\" ) metrics = dict (cur.fetchall()) conn.close() return metrics if __name__ == \"__main__\" : import uvicorn uvicorn.run(app, host = \"0.0.0.0\" , port = 8000 ) The Dockerfile ( services/api-gateway/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;main.py&quot;] Service 2: PDF Ingestion This service validates PDFs, extracts metadata, and stores files in MinIO. Create services/pdf-ingestion/requirements.txt : pika PyPDF2 minio python-dotenv The code ( services/pdf-ingestion/worker.py ): import pika import json import os from minio import Minio from PyPDF2 import PdfReader RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) MINIO_URL = os.environ.get( \"MINIO_URL\" ) MINIO_ACCESS = os.environ.get( \"MINIO_ACCESS_KEY\" ) MINIO_SECRET = os.environ.get( \"MINIO_SECRET_KEY\" ) minio_client = Minio( MINIO_URL , access_key = MINIO_ACCESS , secret_key = MINIO_SECRET , secure = False ) # Ensure bucket exists if not minio_client.bucket_exists( \"pdfs\" ): minio_client.make_bucket( \"pdfs\" ) def process_upload (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] file_path = data[ 'file_path' ] try : # Validate PDF reader = PdfReader(file_path) page_count = len (reader.pages) # Store in MinIO minio_client.fput_object( \"pdfs\" , f \" { job_id } .pdf\" , file_path) # Publish to next queue connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.extraction\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"text.extraction\" , body = json.dumps({ ** data, \"page_count\" : page_count, \"minio_path\" : f \"pdfs/ { job_id } .pdf\" }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() os.remove(file_path) ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Processed: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"pdf.uploaded\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"pdf.uploaded\" , on_message_callback = process_upload) print ( \"PDF Ingestion Service Started...\" ) channel.start_consuming() import pika import json import os from minio import Minio from PyPDF2 import PdfReader RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) MINIO_URL = os.environ.get( \"MINIO_URL\" ) MINIO_ACCESS = os.environ.get( \"MINIO_ACCESS_KEY\" ) MINIO_SECRET = os.environ.get( \"MINIO_SECRET_KEY\" ) minio_client = Minio( MINIO_URL , access_key = MINIO_ACCESS , secret_key = MINIO_SECRET , secure = False ) # Ensure bucket exists if not minio_client.bucket_exists( \"pdfs\" ): minio_client.make_bucket( \"pdfs\" ) def process_upload (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] file_path = data[ 'file_path' ] try : # Validate PDF reader = PdfReader(file_path) page_count = len (reader.pages) # Store in MinIO minio_client.fput_object( \"pdfs\" , f \" { job_id } .pdf\" , file_path) # Publish to next queue connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.extraction\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"text.extraction\" , body = json.dumps({ ** data, \"page_count\" : page_count, \"minio_path\" : f \"pdfs/ { job_id } .pdf\" }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() os.remove(file_path) ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Processed: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"pdf.uploaded\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"pdf.uploaded\" , on_message_callback = process_upload) print ( \"PDF Ingestion Service Started...\" ) channel.start_consuming() The Dockerfile ( services/pdf-ingestion/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Service 3: Text Extractor This service extracts text from PDFs using PyPDF2 or pdfplumber, with OCR fallback. Create services/text-extractor/requirements.txt : pika PyPDF2 pdfplumber pytesseract pdf2image python-dotenv The code ( services/text-extractor/worker.py ): import pika import json import os from PyPDF2 import PdfReader import pdfplumber RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) def extract_text (file_path): text = \"\" # Try PyPDF2 first try : reader = PdfReader(file_path) for page in reader.pages: text += page.extract_text() except : pass # Fallback to pdfplumber if PyPDF2 fails if len (text.strip()) &#x3C; 100 : with pdfplumber.open(file_path) as pdf: for page in pdf.pages: text += page.extract_text() or \"\" return text.strip() def process_extraction (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : # Download from MinIO (simplified - assume local for demo) file_path = f \"/tmp/ { job_id } .pdf\" text = extract_text(file_path) # Publish to analysis queue connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.ready\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"text.ready\" , body = json.dumps({ ** data, \"extracted_text\" : text }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Extracted text from: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.extraction\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"text.extraction\" , on_message_callback = process_extraction) print ( \"Text Extractor Service Started...\" ) channel.start_consuming() import pika import json import os from PyPDF2 import PdfReader import pdfplumber RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) def extract_text (file_path): text = \"\" # Try PyPDF2 first try : reader = PdfReader(file_path) for page in reader.pages: text += page.extract_text() except : pass # Fallback to pdfplumber if PyPDF2 fails if len (text.strip()) &#x3C; 100 : with pdfplumber.open(file_path) as pdf: for page in pdf.pages: text += page.extract_text() or \"\" return text.strip() def process_extraction (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : # Download from MinIO (simplified - assume local for demo) file_path = f \"/tmp/ { job_id } .pdf\" text = extract_text(file_path) # Publish to analysis queue connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.ready\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"text.ready\" , body = json.dumps({ ** data, \"extracted_text\" : text }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Extracted text from: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.extraction\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"text.extraction\" , on_message_callback = process_extraction) print ( \"Text Extractor Service Started...\" ) channel.start_consuming() The Dockerfile ( services/text-extractor/Dockerfile ): FROM python:3.9-slim RUN apt-get update &amp;&amp; apt-get install -y tesseract-ocr &amp;&amp; rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Service 4: AI Analyzer (Groq) This is where Groq delivers high-speed inference for document analysis. Create services/ai-analyzer/requirements.txt : pika groq python-dotenv The code ( services/ai-analyzer/worker.py ): import pika import json import os from groq import Groq RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) GROQ_API_KEY = os.environ.get( \"GROQ_API_KEY\" ) client = Groq( api_key = GROQ_API_KEY ) def analyze_text (text, analysis_type, model): prompts = { \"summary\" : \"Provide a concise summary of this document in 3-5 bullet points.\" , \"classification\" : \"Classify this document by type and main topics.\" , \"qa_generation\" : \"Generate 5 question-answer pairs from this document.\" , \"full\" : \"Provide a comprehensive analysis including summary, key entities, and main themes.\" } prompt = prompts.get(analysis_type, prompts[ \"summary\" ]) response = client.chat.completions.create( model = model, messages = [ { \"role\" : \"system\" , \"content\" : \"You are a helpful document analysis assistant.\" }, { \"role\" : \"user\" , \"content\" : f \" { prompt }\\n\\n Document: \\n{ text[: 8000 ] } \" } ], temperature = 0.3 ) return response.choices[ 0 ].message.content def process_analysis (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : result = analyze_text( data[ 'extracted_text' ], data[ 'analysis_type' ], data[ 'model' ] ) # Publish results connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"analysis.done\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"analysis.done\" , body = json.dumps({ \"job_id\" : job_id, \"result\" : result }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Analyzed: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.ready\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"text.ready\" , on_message_callback = process_analysis) print ( \"AI Analyzer Service Started...\" ) channel.start_consuming() import pika import json import os from groq import Groq RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) GROQ_API_KEY = os.environ.get( \"GROQ_API_KEY\" ) client = Groq( api_key = GROQ_API_KEY ) def analyze_text (text, analysis_type, model): prompts = { \"summary\" : \"Provide a concise summary of this document in 3-5 bullet points.\" , \"classification\" : \"Classify this document by type and main topics.\" , \"qa_generation\" : \"Generate 5 question-answer pairs from this document.\" , \"full\" : \"Provide a comprehensive analysis including summary, key entities, and main themes.\" } prompt = prompts.get(analysis_type, prompts[ \"summary\" ]) response = client.chat.completions.create( model = model, messages = [ { \"role\" : \"system\" , \"content\" : \"You are a helpful document analysis assistant.\" }, { \"role\" : \"user\" , \"content\" : f \" { prompt }\\n\\n Document: \\n{ text[: 8000 ] } \" } ], temperature = 0.3 ) return response.choices[ 0 ].message.content def process_analysis (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : result = analyze_text( data[ 'extracted_text' ], data[ 'analysis_type' ], data[ 'model' ] ) # Publish results connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"analysis.done\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"analysis.done\" , body = json.dumps({ \"job_id\" : job_id, \"result\" : result }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Analyzed: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.ready\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"text.ready\" , on_message_callback = process_analysis) print ( \"AI Analyzer Service Started...\" ) channel.start_consuming() The Dockerfile ( services/ai-analyzer/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Service 5: Results Handler The final service persists results to PostgreSQL and caches in Redis. Create services/results-handler/requirements.txt : pika psycopg2-binary redis python-dotenv The code ( services/results-handler/worker.py ): import pika import json import os import psycopg2 import redis RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) POSTGRES_URL = os.environ.get( \"POSTGRES_URL\" ) REDIS_URL = os.environ.get( \"REDIS_URL\" ) redis_client = redis.from_url( REDIS_URL ) def get_db (): return psycopg2.connect( POSTGRES_URL ) def store_results (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : conn = get_db() cur = conn.cursor() # Store result cur.execute( \"INSERT INTO results (job_id, result_data) VALUES ( %s , %s )\" , (job_id, json.dumps({ \"analysis\" : data[ 'result' ]})) ) # Update job status cur.execute( \"UPDATE jobs SET status = 'completed', updated_at = CURRENT_TIMESTAMP WHERE id = %s \" , (job_id,) ) conn.commit() conn.close() # Invalidate cache redis_client.delete( f \"job: { job_id } \" ) ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Stored results for: { job_id } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"analysis.done\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"analysis.done\" , on_message_callback = store_results) print ( \"Results Handler Service Started...\" ) channel.start_consuming() import pika import json import os import psycopg2 import redis RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) POSTGRES_URL = os.environ.get( \"POSTGRES_URL\" ) REDIS_URL = os.environ.get( \"REDIS_URL\" ) redis_client = redis.from_url( REDIS_URL ) def get_db (): return psycopg2.connect( POSTGRES_URL ) def store_results (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : conn = get_db() cur = conn.cursor() # Store result cur.execute( \"INSERT INTO results (job_id, result_data) VALUES ( %s , %s )\" , (job_id, json.dumps({ \"analysis\" : data[ 'result' ]})) ) # Update job status cur.execute( \"UPDATE jobs SET status = 'completed', updated_at = CURRENT_TIMESTAMP WHERE id = %s \" , (job_id,) ) conn.commit() conn.close() # Invalidate cache redis_client.delete( f \"job: { job_id } \" ) ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Stored results for: { job_id } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"analysis.done\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"analysis.done\" , on_message_callback = store_results) print ( \"Results Handler Service Started...\" ) channel.start_consuming() The Dockerfile ( services/results-handler/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Part 3: The Streamlit Frontend The UI provides an intuitive interface for uploads, real-time monitoring, and result visualization. Create services/streamlit-ui/requirements.txt : streamlit requests python-dotenv The code ( services/streamlit-ui/app.py ): import streamlit as st import requests import time import json API_URL = \"http://api-gateway:8000\" st.set_page_config( page_title = \"PDF AI Analysis\" , layout = \"wide\" ) st.title( \"PDF AI Analysis Pipeline\" ) tab1, tab2, tab3 = st.tabs([ \"Upload &#x26; Analyze\" , \"Dashboard\" , \"History\" ]) # TAB 1: Upload with tab1: uploaded_file = st.file_uploader( \"Upload PDF\" , type = [ \"pdf\" ]) col1, col2 = st.columns( 2 ) with col1: analysis_type = st.selectbox( \"Analysis Type\" , [ \"summary\" , \"classification\" , \"qa_generation\" , \"full\" ] ) with col2: model = st.selectbox( \"Model\" , [ \"llama-3.1-70b\" , \"mixtral-8x7b\" ]) if st.button( \"Start Analysis\" ) and uploaded_file: with st.spinner( \"Uploading...\" ): files = { \"file\" : uploaded_file} data = { \"analysis_type\" : analysis_type, \"model\" : model} response = requests.post( f \" {API_URL} /analyze\" , files = files, data = data) job_id = response.json()[ \"job_id\" ] st.success( f \"Job started: { job_id } \" ) # Poll for status progress_bar = st.progress( 0 ) status_text = st.empty() while True : status_response = requests.get( f \" {API_URL} /status/ { job_id } \" ) status = status_response.json()[ \"status\" ] status_text.text( f \"Status: { status } \" ) if status == \"completed\" : progress_bar.progress( 100 ) results = requests.get( f \" {API_URL} /results/ { job_id } \" ).json() st.json(results) break elif status == \"failed\" : st.error( \"Analysis failed\" ) break progress_bar.progress( 50 ) time.sleep( 2 ) # TAB 2: Dashboard with tab2: metrics = requests.get( f \" {API_URL} /metrics\" ).json() col1, col2, col3 = st.columns( 3 ) col1.metric( \"Total Jobs\" , sum (metrics.values())) col2.metric( \"Completed\" , metrics.get( \"completed\" , 0 )) col3.metric( \"Failed\" , metrics.get( \"failed\" , 0 )) # TAB 3: History with tab3: st.write( \"Coming soon: Job history and search\" ) import streamlit as st import requests import time import json API_URL = \"http://api-gateway:8000\" st.set_page_config( page_title = \"PDF AI Analysis\" , layout = \"wide\" ) st.title( \"PDF AI Analysis Pipeline\" ) tab1, tab2, tab3 = st.tabs([ \"Upload &#x26; Analyze\" , \"Dashboard\" , \"History\" ]) # TAB 1: Upload with tab1: uploaded_file = st.file_uploader( \"Upload PDF\" , type = [ \"pdf\" ]) col1, col2 = st.columns( 2 ) with col1: analysis_type = st.selectbox( \"Analysis Type\" , [ \"summary\" , \"classification\" , \"qa_generation\" , \"full\" ] ) with col2: model = st.selectbox( \"Model\" , [ \"llama-3.1-70b\" , \"mixtral-8x7b\" ]) if st.button( \"Start Analysis\" ) and uploaded_file: with st.spinner( \"Uploading...\" ): files = { \"file\" : uploaded_file} data = { \"analysis_type\" : analysis_type, \"model\" : model} response = requests.post( f \" {API_URL} /analyze\" , files = files, data = data) job_id = response.json()[ \"job_id\" ] st.success( f \"Job started: { job_id } \" ) # Poll for status progress_bar = st.progress( 0 ) status_text = st.empty() while True : status_response = requests.get( f \" {API_URL} /status/ { job_id } \" ) status = status_response.json()[ \"status\" ] status_text.text( f \"Status: { status } \" ) if status == \"completed\" : progress_bar.progress( 100 ) results = requests.get( f \" {API_URL} /results/ { job_id } \" ).json() st.json(results) break elif status == \"failed\" : st.error( \"Analysis failed\" ) break progress_bar.progress( 50 ) time.sleep( 2 ) # TAB 2: Dashboard with tab2: metrics = requests.get( f \" {API_URL} /metrics\" ).json() col1, col2, col3 = st.columns( 3 ) col1.metric( \"Total Jobs\" , sum (metrics.values())) col2.metric( \"Completed\" , metrics.get( \"completed\" , 0 )) col3.metric( \"Failed\" , metrics.get( \"failed\" , 0 )) # TAB 3: History with tab3: st.write( \"Coming soon: Job history and search\" ) The Dockerfile ( services/streamlit-ui/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8501 CMD [&quot;streamlit&quot;, &quot;run&quot;, &quot;app.py&quot;, &quot;--server.port=8501&quot;, &quot;--server.address=0.0.0.0&quot;] Final Integration: Launch Day Running the Stack Build and run: docker-compose up -d --build docker-compose up -d --build Access the UI: open http://localhost:8501 Upload a PDF: select analysis type and model, then watch real-time processing Monitor queues: visit RabbitMQ Management at http://localhost:15672 You have just built a production-grade PDF analysis pipeline. The system scales horizontally, handles failures gracefully through RabbitMQ acknowledgments, and leverages Groq's inference speed for real-time document processing. Scalability — add replicas to any service independently based on bottlenecks. Cost efficiency — Groq's API is 10-100x faster than alternatives, reducing processing time and costs. User experience — Streamlit provides immediate feedback while processing happens asynchronously in the background. This architecture demonstrates the power of event-driven microservices and local-first AI integration. Each service owns a single responsibility, communicates through well-defined message contracts, and can be developed and deployed independently by different teams. Happy coding!"
  },
  {
    "slug": "2026-01-17-building-an-ai-driven-youtube-index",
    "title": "Building a Private, AI-Driven YouTube Knowledge Base",
    "description": "Turn your YouTube subscriptions into a searchable, private RAG engine — autonomous ingestion with yt-dlp, transcription with faster-whisper, embeddings via Ollama and pgvector, and a Streamlit/LangChain chat UI.",
    "tags": [
      "ai",
      "youtube",
      "streamlit",
      "langchain",
      "rag",
      "ollama"
    ],
    "excerpt": "For IT professionals and developers, YouTube has evolved from an entertainment platform into a primary source of continuous education. We rely on it for everything from architectural patterns and cloud infrastructure tutorials to debugging sessions a",
    "content": "For IT professionals and developers, YouTube has evolved from an entertainment platform into a primary source of continuous education. We rely on it for everything from architectural patterns and cloud infrastructure tutorials to debugging sessions and conference talks. However, video is inherently opaque data. Unlike documentation or code repositories, you cannot &quot;Ctrl+F&quot; your way through thousands of hours of video history to find that one specific explanation of a concept you watched six months ago. We face a significant gap between content consumption and knowledge retention. The Problem: The Unsearchable Archive As we accumulate subscriptions, we build a massive library of potential knowledge that remains largely inaccessible. The challenges are structural: The &quot;Black Box&quot; of Video — Valuable technical insights are often buried deep within long-form content, invisible to standard metadata searches. Fragmentation — Knowledge is siloed across hundreds of channels with no unified way to cross-reference topics (e.g., comparing how three different channels handle Kubernetes networking). Ephemeral Recall — We watch a solution once, but without a text-based index, retrieving that solution during a future incident is nearly impossible. The Solution: A Private RAG Engine In this guide, we are going to build a solution to shift from passive consumption to active conversation. We will build a Retrieval Augmented Generation (RAG) system that treats your YouTube subscriptions as a private dataset. By leveraging LangChain and Ollama locally, we can create a system that lets you chat with your video history. You can ask, &quot;How does NetworkChuck explain VLANs?&quot; and the system will not only find the video but synthesize an answer based on the transcript. Core Architecture To turn this concept into reality, we will adopt a microservices approach using Docker. At a high level, the pipeline involves five stages: Ingestion — A service autonomously monitors your subscriptions for new content. Transcription — Using Whisper, the system converts unstructured audio into timestamped text. Indexing — The system chunks transcripts and processes them through an embedding model ( nomic-embed-text ), storing the vectors in PostgreSQL. Retrieval — Your questions are converted into vectors to find relevant transcript segments. Synthesis — Llama 3 reads the retrieved context and generates a precise answer, citing specific video timestamps. Here is the architecture we will build: INTERNET (YouTube) ^ ^ | | (1) User Visits UI | | (3) Download Audio (yt-dlp) (Browser) | | | | | v | | +-----------------------------------------------------------------------+ | HOST MACHINE (Port 8501) | | | +-------------------------------------+-----+---------------------------+ | | | | | DOCKER NETWORK (yt-net) | | | | | | | | +-------------------+ +------+-----+------+ | | | Streamlit UI | | Ingestion Service | | | | [LangChain Client]|&lt;------| (The Watcher) | | | | | | | | | | Ports: 8501:8501 | | [yt-dlp/RSS] | | | +--------+-----+----+ +---------+---------+ | | | | | | | | | (2) Search | (4) Push Job | | | | Vector | (AMQP) | | | v v | | | +-----------------------------------+ | | (5) Gen | | RabbitMQ | | | Query | | (Message Broker) | | | Embed | | | | | (HTTP) | | Ports: 15672:15672 (Mgmt UI) | | | | +-----------------+-----------------+ | | | | | | | | (6) Pull Job | | | | (AMQP) | | | v | | | +---------+---------+ | | | | Processing Worker | | | | | (The Brain) | | | | | | | | | | [faster-whisper] | | | | | [ffmpeg] | | | +---------&gt;| [yt-dlp] | | | ^ +----+---------+----+ | | | | | | | (9) Chat | (7) Gen | | (8) Store | | With | Embed | | Data | | Data | (HTTP) | | (SQL) | | | v v | | +--------+----------+ +---------+---------+ | | | Ollama | | PostgreSQL | | | | (AI Model) | | (Data Layer) | | | | | | | | | | [nomic-embed-text]| | [pgvector] | | | | [llama3] | | [Videos/Subs] | | | +-------------------+ +-------------------+ | | | +-----------------------------------------------------------------------+ Part 1: The Infrastructure We will start by defining the &quot;plumbing&quot; of our system using Docker Compose and designing our PostgreSQL database schema to handle vector embeddings. Folder Structure Treat this project as a monorepo. Open your terminal and create the following structure: mkdir yt-rag-engine cd yt-rag-engine mkdir database touch docker-compose.yml .env database/init.sql mkdir yt-rag-engine cd yt-rag-engine mkdir database touch docker-compose.yml .env database/init.sql The Docker Compose File We need to orchestrate three core services: PostgreSQL (with pgvector) — to store our data and embeddings. RabbitMQ — to manage our background processing queues. Ollama — to run our local LLMs (Llama 3 and Nomic Embed). Open docker-compose.yml and add the following configuration: version : '3.8' services : # 1. The Database (Postgres + pgvector) postgres : image : pgvector/pgvector:pg16 container_name : yt_db environment : POSTGRES_USER : ${DB_USER} POSTGRES_PASSWORD : ${DB_PASS} POSTGRES_DB : ${DB_NAME} ports : - \"5432:5432\" volumes : - postgres_data:/var/lib/postgresql/data - ./database/init.sql:/docker-entrypoint-initdb.d/init.sql networks : - yt-net healthcheck : test : [ \"CMD-SHELL\" , \"pg_isready -U ${DB_USER} -d ${DB_NAME}\" ] interval : 10s timeout : 5s retries : 5 # 2. The Message Broker (RabbitMQ) rabbitmq : image : rabbitmq:3-management container_name : yt_queue ports : - \"5672:5672\" # AMQP protocol - \"15672:15672\" # Management UI environment : RABBITMQ_DEFAULT_USER : ${RABBIT_USER} RABBITMQ_DEFAULT_PASS : ${RABBIT_PASS} networks : - yt-net healthcheck : test : [ \"CMD\" , \"rabbitmq-diagnostics\" , \"-q\" , \"ping\" ] interval : 10s timeout : 5s retries : 5 # 3. The AI Server (Ollama) ollama : image : ollama/ollama:latest container_name : yt_ai ports : - \"11434:11434\" volumes : - ollama_models:/root/.ollama networks : - yt-net # Uncomment below to enable GPU support (Nvidia) # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: 1 # capabilities: [gpu] volumes : postgres_data : ollama_models : networks : yt-net : driver : bridge version : '3.8' services : # 1. The Database (Postgres + pgvector) postgres : image : pgvector/pgvector:pg16 container_name : yt_db environment : POSTGRES_USER : ${DB_USER} POSTGRES_PASSWORD : ${DB_PASS} POSTGRES_DB : ${DB_NAME} ports : - \"5432:5432\" volumes : - postgres_data:/var/lib/postgresql/data - ./database/init.sql:/docker-entrypoint-initdb.d/init.sql networks : - yt-net healthcheck : test : [ \"CMD-SHELL\" , \"pg_isready -U ${DB_USER} -d ${DB_NAME}\" ] interval : 10s timeout : 5s retries : 5 # 2. The Message Broker (RabbitMQ) rabbitmq : image : rabbitmq:3-management container_name : yt_queue ports : - \"5672:5672\" # AMQP protocol - \"15672:15672\" # Management UI environment : RABBITMQ_DEFAULT_USER : ${RABBIT_USER} RABBITMQ_DEFAULT_PASS : ${RABBIT_PASS} networks : - yt-net healthcheck : test : [ \"CMD\" , \"rabbitmq-diagnostics\" , \"-q\" , \"ping\" ] interval : 10s timeout : 5s retries : 5 # 3. The AI Server (Ollama) ollama : image : ollama/ollama:latest container_name : yt_ai ports : - \"11434:11434\" volumes : - ollama_models:/root/.ollama networks : - yt-net # Uncomment below to enable GPU support (Nvidia) # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: 1 # capabilities: [gpu] volumes : postgres_data : ollama_models : networks : yt-net : driver : bridge Environment Variables Create a .env file to keep secrets safe: # Database Credentials DB_USER = admin DB_PASS = secretpassword DB_NAME = yt_knowledge_base # RabbitMQ Credentials RABBIT_USER = guest RABBIT_PASS = guest # Database Credentials DB_USER = admin DB_PASS = secretpassword DB_NAME = yt_knowledge_base # RabbitMQ Credentials RABBIT_USER = guest RABBIT_PASS = guest Designing the Schema (pgvector) We need to tell PostgreSQL how to structure our data. The most critical part is enabling the vector extension and defining the embedding column. We are using nomic-embed-text via Ollama, which outputs vectors with 768 dimensions. Open database/init.sql and add this SQL script: -- 1. Enable the pgvector extension CREATE EXTENSION IF NOT EXISTS vector; -- 2. Channels Table: Who are we watching? CREATE TABLE IF NOT EXISTS channels ( id TEXT PRIMARY KEY, -- YouTube Channel ID (e.g., UC123...) name TEXT NOT NULL, url TEXT NOT NULL, last_checked_at TIMESTAMP DEFAULT '1970-01-01' ); -- 3. Videos Table: Metadata for individual videos CREATE TABLE IF NOT EXISTS videos ( id TEXT PRIMARY KEY, -- YouTube Video ID (e.g., dQw4w9WgXcQ) channel_id TEXT REFERENCES channels(id), title TEXT NOT NULL, url TEXT NOT NULL, published_at TIMESTAMP, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, status TEXT DEFAULT 'pending' -- pending, processing, completed, error ); -- 4. Transcripts Table: The searchable content CREATE TABLE IF NOT EXISTS transcript_chunks ( id SERIAL PRIMARY KEY, video_id TEXT REFERENCES videos(id) ON DELETE CASCADE, -- The actual text content (for RAG context) chunk_text TEXT NOT NULL, -- Timestamps for deep-linking start_time DOUBLE PRECISION, end_time DOUBLE PRECISION, -- The AI &quot;Brain&quot; Part -- 768 dimensions matches nomic-embed-text embedding vector(768) ); -- 5. Create a search index for speed (HNSW algorithm) CREATE INDEX ON transcript_chunks USING hnsw (embedding vector_cosine_ops); Booting Up and Priming Models Before writing code, let's bring up the infrastructure and download the AI models. Start Docker: docker-compose up -d docker-compose up -d Pull models. Ollama starts empty. Execute these commands to pull the models into the persistent volume: # Pull the Chat Model (for RAG synthesis) docker exec -it yt_ai ollama pull llama3 # Pull the Embedding Model (for Vectorizing) docker exec -it yt_ai ollama pull nomic-embed-text # Pull the Chat Model (for RAG synthesis) docker exec -it yt_ai ollama pull llama3 # Pull the Embedding Model (for Vectorizing) docker exec -it yt_ai ollama pull nomic-embed-text Part 2: The Backend Engine With the infrastructure running, we will now build the two Python services that power the system: the Ingestion Service (Discovery) and the Processing Worker (Analysis). Service 1: The Ingestion Service This service checks RSS feeds and creates &quot;Job Tickets&quot; in RabbitMQ. Create a folder ingestion_service with a requirements.txt : pika psycopg2-binary feedparser python-dotenv The code ( ingestion_service/main.py ): import time import feedparser import pika import json import psycopg2 import os from datetime import datetime # Connect to Infrastructure DB_HOST = \"postgres\" RABBIT_HOST = \"rabbitmq\" QUEUE_NAME = \"transcription_queue\" def get_db_connection (): return psycopg2.connect( host = DB_HOST , database = os.environ.get( \"DB_NAME\" ), user = os.environ.get( \"DB_USER\" ), password = os.environ.get( \"DB_PASS\" ) ) def publish_to_queue (video_data): connection = pika.BlockingConnection(pika.ConnectionParameters( host = RABBIT_HOST )) channel = connection.channel() channel.queue_declare( queue = QUEUE_NAME , durable = True ) channel.basic_publish( exchange = '' , routing_key = QUEUE_NAME , body = json.dumps(video_data), properties = pika.BasicProperties( delivery_mode = 2 ) # Make message persistent ) connection.close() def check_feeds (): conn = get_db_connection() cur = conn.cursor() # 1. Get all monitored channels cur.execute( \"SELECT id, url FROM channels\" ) channels = cur.fetchall() for channel_id, channel_url in channels: # YouTube RSS URL format rss_url = f \"https://www.youtube.com/feeds/videos.xml?channel_id= { channel_id } \" feed = feedparser.parse(rss_url) for entry in feed.entries: video_id = entry.yt_videoid # 2. Check if we already have this video cur.execute( \"SELECT 1 FROM videos WHERE id = %s \" , (video_id,)) if cur.fetchone() is None : print ( f \"Found new video: { entry.title } \" ) # 3. Add to DB as 'pending' cur.execute( \"INSERT INTO videos (id, channel_id, title, url, published_at, status) VALUES ( %s , %s , %s , %s , %s , 'pending')\" , (video_id, channel_id, entry.title, entry.link, datetime.now()) ) conn.commit() # 4. Push to RabbitMQ publish_to_queue({ \"video_id\" : video_id, \"url\" : entry.link, \"title\" : entry.title }) conn.close() if __name__ == \"__main__\" : print ( \"Ingestion Service Started...\" ) while True : try : check_feeds() except Exception as e: print ( f \"Error: { e } \" ) time.sleep( 3600 ) # Sleep for 1 hour import time import feedparser import pika import json import psycopg2 import os from datetime import datetime # Connect to Infrastructure DB_HOST = \"postgres\" RABBIT_HOST = \"rabbitmq\" QUEUE_NAME = \"transcription_queue\" def get_db_connection (): return psycopg2.connect( host = DB_HOST , database = os.environ.get( \"DB_NAME\" ), user = os.environ.get( \"DB_USER\" ), password = os.environ.get( \"DB_PASS\" ) ) def publish_to_queue (video_data): connection = pika.BlockingConnection(pika.ConnectionParameters( host = RABBIT_HOST )) channel = connection.channel() channel.queue_declare( queue = QUEUE_NAME , durable = True ) channel.basic_publish( exchange = '' , routing_key = QUEUE_NAME , body = json.dumps(video_data), properties = pika.BasicProperties( delivery_mode = 2 ) # Make message persistent ) connection.close() def check_feeds (): conn = get_db_connection() cur = conn.cursor() # 1. Get all monitored channels cur.execute( \"SELECT id, url FROM channels\" ) channels = cur.fetchall() for channel_id, channel_url in channels: # YouTube RSS URL format rss_url = f \"https://www.youtube.com/feeds/videos.xml?channel_id= { channel_id } \" feed = feedparser.parse(rss_url) for entry in feed.entries: video_id = entry.yt_videoid # 2. Check if we already have this video cur.execute( \"SELECT 1 FROM videos WHERE id = %s \" , (video_id,)) if cur.fetchone() is None : print ( f \"Found new video: { entry.title } \" ) # 3. Add to DB as 'pending' cur.execute( \"INSERT INTO videos (id, channel_id, title, url, published_at, status) VALUES ( %s , %s , %s , %s , %s , 'pending')\" , (video_id, channel_id, entry.title, entry.link, datetime.now()) ) conn.commit() # 4. Push to RabbitMQ publish_to_queue({ \"video_id\" : video_id, \"url\" : entry.link, \"title\" : entry.title }) conn.close() if __name__ == \"__main__\" : print ( \"Ingestion Service Started...\" ) while True : try : check_feeds() except Exception as e: print ( f \"Error: { e } \" ) time.sleep( 3600 ) # Sleep for 1 hour The Dockerfile ( ingestion_service/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;main.py&quot;] Service 2: The Processing Worker This worker converts audio, transcribes it, and embeds it. Create a folder processing_worker with a requirements.txt : pika psycopg2-binary yt-dlp faster-whisper requests python-dotenv The code ( processing_worker/worker.py ): import pika import json import os import psycopg2 import requests import yt_dlp from faster_whisper import WhisperModel # Config OLLAMA_API = \"http://ollama:11434/api/embeddings\" MODEL_NAME = \"nomic-embed-text\" TEMP_DIR = \"/app/temp\" # Initialize Whisper (runs on CPU by default, or GPU if passed to Docker) model = WhisperModel( \"tiny\" , device = \"cpu\" , compute_type = \"int8\" ) def download_audio (video_url, video_id): \"\"\"Downloads audio using yt-dlp to a temp file\"\"\" output_path = f \" {TEMP_DIR} / { video_id } \" ydl_opts = { 'format' : 'bestaudio/best' , 'outtmpl' : output_path, 'postprocessors' : [{ 'key' : 'FFmpegExtractAudio' , 'preferredcodec' : 'mp3' }], 'quiet' : True } with yt_dlp.YoutubeDL(ydl_opts) as ydl: ydl.download([video_url]) return f \" { output_path } .mp3\" def get_embedding (text): \"\"\"Calls Ollama to get vector embedding\"\"\" response = requests.post( OLLAMA_API , json = { \"model\" : MODEL_NAME , \"prompt\" : text }) return response.json()[ 'embedding' ] def process_video (ch, method, properties, body): data = json.loads(body) video_id = data[ 'video_id' ] print ( f \"Processing: { data[ 'title' ] } \" ) try : # 1. Download Audio audio_path = download_audio(data[ 'url' ], video_id) # 2. Transcribe segments, _ = model.transcribe(audio_path) conn = psycopg2.connect( host = \"postgres\" , database = os.environ.get( \"DB_NAME\" ), user = os.environ.get( \"DB_USER\" ), password = os.environ.get( \"DB_PASS\" ) ) cur = conn.cursor() # 3. Chunk &#x26; Embed chunk_buffer = \"\" start_time = 0.0 for segment in segments: chunk_buffer += segment.text + \" \" # Create a chunk roughly every 500 characters if len (chunk_buffer) > 500 : vector = get_embedding(chunk_buffer) cur.execute( \"\"\"INSERT INTO transcript_chunks (video_id, chunk_text, start_time, end_time, embedding) VALUES ( %s , %s , %s , %s , %s )\"\"\" , (video_id, chunk_buffer, start_time, segment.end, vector) ) chunk_buffer = \"\" start_time = segment.end # 4. Mark Complete cur.execute( \"UPDATE videos SET status = 'completed' WHERE id = %s \" , (video_id,)) conn.commit() conn.close() os.remove(audio_path) print ( f \"Done: { data[ 'title' ] } \" ) ch.basic_ack( delivery_tag = method.delivery_tag) except Exception as e: print ( f \"Error processing { video_id } : { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.ConnectionParameters( \"rabbitmq\" )) channel = connection.channel() channel.queue_declare( queue = \"transcription_queue\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"transcription_queue\" , on_message_callback = process_video) print ( \"Processing Worker Started...\" ) channel.start_consuming() import pika import json import os import psycopg2 import requests import yt_dlp from faster_whisper import WhisperModel # Config OLLAMA_API = \"http://ollama:11434/api/embeddings\" MODEL_NAME = \"nomic-embed-text\" TEMP_DIR = \"/app/temp\" # Initialize Whisper (runs on CPU by default, or GPU if passed to Docker) model = WhisperModel( \"tiny\" , device = \"cpu\" , compute_type = \"int8\" ) def download_audio (video_url, video_id): \"\"\"Downloads audio using yt-dlp to a temp file\"\"\" output_path = f \" {TEMP_DIR} / { video_id } \" ydl_opts = { 'format' : 'bestaudio/best' , 'outtmpl' : output_path, 'postprocessors' : [{ 'key' : 'FFmpegExtractAudio' , 'preferredcodec' : 'mp3' }], 'quiet' : True } with yt_dlp.YoutubeDL(ydl_opts) as ydl: ydl.download([video_url]) return f \" { output_path } .mp3\" def get_embedding (text): \"\"\"Calls Ollama to get vector embedding\"\"\" response = requests.post( OLLAMA_API , json = { \"model\" : MODEL_NAME , \"prompt\" : text }) return response.json()[ 'embedding' ] def process_video (ch, method, properties, body): data = json.loads(body) video_id = data[ 'video_id' ] print ( f \"Processing: { data[ 'title' ] } \" ) try : # 1. Download Audio audio_path = download_audio(data[ 'url' ], video_id) # 2. Transcribe segments, _ = model.transcribe(audio_path) conn = psycopg2.connect( host = \"postgres\" , database = os.environ.get( \"DB_NAME\" ), user = os.environ.get( \"DB_USER\" ), password = os.environ.get( \"DB_PASS\" ) ) cur = conn.cursor() # 3. Chunk &#x26; Embed chunk_buffer = \"\" start_time = 0.0 for segment in segments: chunk_buffer += segment.text + \" \" # Create a chunk roughly every 500 characters if len (chunk_buffer) > 500 : vector = get_embedding(chunk_buffer) cur.execute( \"\"\"INSERT INTO transcript_chunks (video_id, chunk_text, start_time, end_time, embedding) VALUES ( %s , %s , %s , %s , %s )\"\"\" , (video_id, chunk_buffer, start_time, segment.end, vector) ) chunk_buffer = \"\" start_time = segment.end # 4. Mark Complete cur.execute( \"UPDATE videos SET status = 'completed' WHERE id = %s \" , (video_id,)) conn.commit() conn.close() os.remove(audio_path) print ( f \"Done: { data[ 'title' ] } \" ) ch.basic_ack( delivery_tag = method.delivery_tag) except Exception as e: print ( f \"Error processing { video_id } : { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.ConnectionParameters( \"rabbitmq\" )) channel = connection.channel() channel.queue_declare( queue = \"transcription_queue\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"transcription_queue\" , on_message_callback = process_video) print ( \"Processing Worker Started...\" ) channel.start_consuming() The Dockerfile ( processing_worker/Dockerfile ). Crucially, we install ffmpeg here for audio extraction: FROM python:3.9-slim # Install ffmpeg RUN apt-get update &amp;&amp; apt-get install -y ffmpeg &amp;&amp; rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Create temp directory RUN mkdir -p /app/temp COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Part 3: The Control Center (Streamlit) Finally, we need a UI to manage subscriptions and — most importantly — chat with the data. Create a folder streamlit_app with a requirements.txt : streamlit langchain-community langchain-core langchain-ollama psycopg2-binary yt-dlp python-dotenv The code ( streamlit_app/app.py ): import streamlit as st import psycopg2 import os import yt_dlp from langchain_ollama import ChatOllama from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser # Config DB_HOST = \"postgres\" DB_NAME = os.environ.get( \"DB_NAME\" ) DB_USER = os.environ.get( \"DB_USER\" ) DB_PASS = os.environ.get( \"DB_PASS\" ) OLLAMA_URL = \"http://ollama:11434\" st.set_page_config( page_title = \"YouTube Knowledge Base\" , layout = \"wide\" ) st.title( \"AI YouTube Knowledge Base\" ) # --- DATABASE FUNCTIONS --- def get_db_connection (): return psycopg2.connect( host = DB_HOST , database = DB_NAME , user = DB_USER , password = DB_PASS ) def add_channel (url): ydl_opts = { 'quiet' : True , 'extract_flat' : True , 'playlist_end' : 0 } with yt_dlp.YoutubeDL(ydl_opts) as ydl: try : info = ydl.extract_info(url, download = False ) channel_id = info.get( 'channel_id' ) name = info.get( 'uploader' ) or info.get( 'channel' ) conn = get_db_connection() cur = conn.cursor() cur.execute( \"INSERT INTO channels (id, name, url) VALUES ( %s , %s , %s ) ON CONFLICT (id) DO NOTHING\" , (channel_id, name, url) ) conn.commit() conn.close() return f \"Success: Added { name } \" except Exception as e: return f \"Error: {str (e) } \" def get_context (query_text): \"\"\"Semantic Search: Vector -> SQL Cosine Similarity\"\"\" from langchain_ollama import OllamaEmbeddings embeddings = OllamaEmbeddings( base_url = OLLAMA_URL , model = \"nomic-embed-text\" ) query_vector = embeddings.embed_query(query_text) conn = get_db_connection() cur = conn.cursor() cur.execute( \"\"\" SELECT t.chunk_text, v.title, v.url, t.start_time FROM transcript_chunks t JOIN videos v ON t.video_id = v.id ORDER BY t.embedding &#x3C;=> %s ::vector LIMIT 5 \"\"\" , ( str (query_vector),) ) results = cur.fetchall() conn.close() return results # --- UI LAYOUT --- tab1, tab2 = st.tabs([ \"Chat with Knowledge\" , \"Manage Subscriptions\" ]) # TAB 1: RAG CHAT with tab1: user_query = st.text_input( \"Ask a question about your videos:\" ) if st.button( \"Ask AI\" ) and user_query: with st.spinner( \"Thinking...\" ): results = get_context(user_query) if not results: st.warning( \"No relevant info found in database.\" ) else : context_text = \"\" for i, (text, title, url, start) in enumerate (results): context_text += f \" \\n [Source { i + 1} ]: { text } (From ' { title } ') \\n \" # RAG Synthesis llm = ChatOllama( base_url = OLLAMA_URL , model = \"llama3\" ) prompt = ChatPromptTemplate.from_template( \"\"\" You are a helpful AI assistant. Answer the user's question based ONLY on the following context. If the answer is not in the context, say \"I don't know\". Context: {context} Question: {question} \"\"\" ) chain = prompt | llm | StrOutputParser() response = chain.invoke({ \"context\" : context_text, \"question\" : user_query}) st.markdown( \"### AI Answer\" ) st.write(response) st.markdown( \"---\" ) st.subheader( \"Reference Clips\" ) for text, title, url, start in results: video_link = f \" { url } &#x26;t= {int (start) } s\" st.markdown( f \"**[ { title } ]( { video_link } )**\" ) st.caption( f \"... { text } ...\" ) # TAB 2: MANAGE with tab2: st.header( \"Add New Channel\" ) new_url = st.text_input( \"Paste Channel URL\" ) if st.button( \"Subscribe\" ): with st.spinner( \"Resolving Channel...\" ): msg = add_channel(new_url) st.write(msg) st.header( \"Active Subscriptions\" ) conn = get_db_connection() df = conn.cursor() df.execute( \"SELECT name, url, last_checked_at FROM channels\" ) rows = df.fetchall() for row in rows: st.write( f \"**[ { row[ 0 ] } ]( { row[ 1 ] } )** - Last Checked: { row[ 2 ] } \" ) conn.close() import streamlit as st import psycopg2 import os import yt_dlp from langchain_ollama import ChatOllama from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser # Config DB_HOST = \"postgres\" DB_NAME = os.environ.get( \"DB_NAME\" ) DB_USER = os.environ.get( \"DB_USER\" ) DB_PASS = os.environ.get( \"DB_PASS\" ) OLLAMA_URL = \"http://ollama:11434\" st.set_page_config( page_title = \"YouTube Knowledge Base\" , layout = \"wide\" ) st.title( \"AI YouTube Knowledge Base\" ) # --- DATABASE FUNCTIONS --- def get_db_connection (): return psycopg2.connect( host = DB_HOST , database = DB_NAME , user = DB_USER , password = DB_PASS ) def add_channel (url): ydl_opts = { 'quiet' : True , 'extract_flat' : True , 'playlist_end' : 0 } with yt_dlp.YoutubeDL(ydl_opts) as ydl: try : info = ydl.extract_info(url, download = False ) channel_id = info.get( 'channel_id' ) name = info.get( 'uploader' ) or info.get( 'channel' ) conn = get_db_connection() cur = conn.cursor() cur.execute( \"INSERT INTO channels (id, name, url) VALUES ( %s , %s , %s ) ON CONFLICT (id) DO NOTHING\" , (channel_id, name, url) ) conn.commit() conn.close() return f \"Success: Added { name } \" except Exception as e: return f \"Error: {str (e) } \" def get_context (query_text): \"\"\"Semantic Search: Vector -> SQL Cosine Similarity\"\"\" from langchain_ollama import OllamaEmbeddings embeddings = OllamaEmbeddings( base_url = OLLAMA_URL , model = \"nomic-embed-text\" ) query_vector = embeddings.embed_query(query_text) conn = get_db_connection() cur = conn.cursor() cur.execute( \"\"\" SELECT t.chunk_text, v.title, v.url, t.start_time FROM transcript_chunks t JOIN videos v ON t.video_id = v.id ORDER BY t.embedding &#x3C;=> %s ::vector LIMIT 5 \"\"\" , ( str (query_vector),) ) results = cur.fetchall() conn.close() return results # --- UI LAYOUT --- tab1, tab2 = st.tabs([ \"Chat with Knowledge\" , \"Manage Subscriptions\" ]) # TAB 1: RAG CHAT with tab1: user_query = st.text_input( \"Ask a question about your videos:\" ) if st.button( \"Ask AI\" ) and user_query: with st.spinner( \"Thinking...\" ): results = get_context(user_query) if not results: st.warning( \"No relevant info found in database.\" ) else : context_text = \"\" for i, (text, title, url, start) in enumerate (results): context_text += f \" \\n [Source { i + 1} ]: { text } (From ' { title } ') \\n \" # RAG Synthesis llm = ChatOllama( base_url = OLLAMA_URL , model = \"llama3\" ) prompt = ChatPromptTemplate.from_template( \"\"\" You are a helpful AI assistant. Answer the user's question based ONLY on the following context. If the answer is not in the context, say \"I don't know\". Context: {context} Question: {question} \"\"\" ) chain = prompt | llm | StrOutputParser() response = chain.invoke({ \"context\" : context_text, \"question\" : user_query}) st.markdown( \"### AI Answer\" ) st.write(response) st.markdown( \"---\" ) st.subheader( \"Reference Clips\" ) for text, title, url, start in results: video_link = f \" { url } &#x26;t= {int (start) } s\" st.markdown( f \"**[ { title } ]( { video_link } )**\" ) st.caption( f \"... { text } ...\" ) # TAB 2: MANAGE with tab2: st.header( \"Add New Channel\" ) new_url = st.text_input( \"Paste Channel URL\" ) if st.button( \"Subscribe\" ): with st.spinner( \"Resolving Channel...\" ): msg = add_channel(new_url) st.write(msg) st.header( \"Active Subscriptions\" ) conn = get_db_connection() df = conn.cursor() df.execute( \"SELECT name, url, last_checked_at FROM channels\" ) rows = df.fetchall() for row in rows: st.write( f \"**[ { row[ 0 ] } ]( { row[ 1 ] } )** - Last Checked: { row[ 2 ] } \" ) conn.close() The Dockerfile ( streamlit_app/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8501 CMD [&quot;streamlit&quot;, &quot;run&quot;, &quot;app.py&quot;, &quot;--server.port=8501&quot;, &quot;--server.address=0.0.0.0&quot;] Final Integration: Launch Day We need to update our docker-compose.yml to include our new Python services. Add the following to the services: block: ingestion : build : ./ingestion_service container_name : yt_ingestion environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy rabbitmq : condition : service_healthy worker : build : ./processing_worker container_name : yt_worker environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy rabbitmq : condition : service_healthy ollama : condition : service_started streamlit : build : ./streamlit_app container_name : yt_ui ports : - \"8501:8501\" environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy ollama : condition : service_started ingestion : build : ./ingestion_service container_name : yt_ingestion environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy rabbitmq : condition : service_healthy worker : build : ./processing_worker container_name : yt_worker environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy rabbitmq : condition : service_healthy ollama : condition : service_started streamlit : build : ./streamlit_app container_name : yt_ui ports : - \"8501:8501\" environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy ollama : condition : service_started Running the Stack Build and run: docker-compose up -d --build docker-compose up -d --build Access the app: open your browser to http://localhost:8501 . Add a channel: go to &quot;Manage Subscriptions&quot; and add a URL like https://www.youtube.com/@Fireship . Watch it work: the ingestion service will queue the latest videos, and the worker will begin transcribing them (view logs with docker logs -f yt_worker ). Chat: once processing is complete, go to the &quot;Chat&quot; tab and ask: &quot;What is the latest JavaScript framework mentioned?&quot; You have just built a completely private, AI-powered knowledge engine. Privacy — no data leaves your machine. Your viewing habits remain yours. Cost — $0. No OpenAI API keys. No SaaS subscriptions. Just local compute. Utility — you have turned a passive stream of entertainment into an active database of answers. This project is a perfect example of the power of Agentic AI and Local LLMs. You didn't just write a script; you built a system that sees, listens, and thinks. The database is yours — build what you need. Happy coding!"
  }
]