← newsletter

Originally published on 12 Grams of Carbon

Agentics: Debugging when everything is on fire

Using AI to debug critical system failures.

Amol Kapoor · November 20, 2025

Have I mentioned that I am really lazy?

A few weeks I posted this note:

Got Claude code to debug a memory leak that was causing my system to crash within a few minutes of the leak starting by creating the conditions for the leak, running Claude with dangerously-skip-permissions, and telling Claude it had a few minutes to figure out what was going on before the system died

It actually managed to figure it out by inspecting the jobs that were causing the system crash in real time. Here’s what it found:

So here’s the infinite loop:

  1. Root runs: concurrently npm:test:ui npm:test:server npm:test:plugin
  2. This executes: cd plugin && npm test
  3. Plugin is empty (no package.json)
  4. npm searches up, finds ROOT package.json
  5. Runs ROOT’s test script again: concurrently npm:test:ui npm:test:server npm:test:plugin
  6. This creates ANOTHER cd plugin && npm test
  7. Infinite recursion!

Took maybe 3 runs (and by ‘run’ I mean ‘system crash’ lmao). Each time the system crashed I would reload a nori claude agent with the previous transcript and go ‘system crashed, pick up where you left off’ and I’d restart the leak. Shockingly effective, end to end the whole debug took ~15 minutes.

"This is Fine" Meme Analysis | Medium
My nori coding agent instance

My current day to day operating procedure is:

My laptop is a piece of shit and sometimes will crash, freeze, or otherwise become unresponsive. It seemingly does this for all sorts of creative reasons, or no reason at all. The most common culprits were OOMs caused by:

I suppose this is inevitable when you have the equivalent of five to seven software engineers running your computer at the same time…the machine only has so many gigs of RAM.

Anyway, whenever my computer would crash I would spin up a nori claude instance and ask it to figure out and profile what is going on. Most of the time it was able to do a decent job. It would look through the journalctl and so on. But at some point it is just fundamentally limited, because the “problem” it is trying to debug is transient and is not currently happening. The best time to figure out why your system is crashing is to be present while it is crashing, but this is really hard. You have to first get really lucky; and second, your system has to remain responsive enough that you can still interact with it using your normal GUI.

Fast forward a bit and I realized that if I could mimic the situation in which a certain OOM was occurring, I could spin up a coding agent while the OOM was happening and get the model to debug. And that worked! In real time the nori coding agent ran through a suite of system commands, figured out which jobs were causing problems, and then even started to dig into the explicit function calls (python and node processes can both be inspected at the function call level by sideloaded processes).

Besides being extremely cool, I realized that with a few tweaks I could make this a legitimately useful SRE tool. The basic idea: any time certain system vitals cross a threshold, spin up a coding agent and have the agent debug what is going on as aggressively as possible, with all logs being streamed to a third party server (in addition to being stored on disk). This basic abstraction would solve two huge problems:

TLDR: I built a nifty tool that I’m calling Nori Premortem. It’s a very simple CLI tool. It takes in a config, sets up a watcher process on your system vitals, and if any of them are ever breached it will spin up a coding agent tasked with figuring out wtf is going on. Nori Premortem is by default hooked into the broader nori observability platform (which we recently branded as nori watchtower), but you can bring your own server if you like, you’ll just have to backwards engineer the API.

You can find the nori-premortem on github (https://github.com/tilework-tech/nori-premortem) and on npm (https://www.npmjs.com/package/nori-premortem).

By the way, a little plug! My company Tilework has shifted into fully focusing on coding agents and their myriad applications, because they are so fun and weird and they can power all sorts of interesting and cool things. We have 40-odd years of developer ecosystem tooling — IDEs and CI/CD and version control and so on. AI coding agents are new. Fundamentally new. And I think people who are building with these tools need a new ecosystem built up around them. That ecosystem is what we’re building. We organically ended up there because we kept building tools that we needed ourselves! Our motto is something like ‘if an AI can’t do it, figure out why until the AI can do it’, and its been working really well for us. I’ll be writing about Tilework more, so I added a little ‘Tilework’ tab on my home screen. If you’re interested in seeing what we’re building, drop us a line or check out our github. More to come.