šŸ¦Ž Keeping My Pangolin Instance Alive on Oracle Cloud (With a Simple Watchdog)

šŸ¦Ž Keeping My Pangolin Instance Alive on Oracle Cloud (With a Simple Watchdog)
created with ChatGPT 5.5

TL;DR

Oracle Cloud can detect when your VM has problems — but it won’t automatically fix them.

I built a small watchdog script that:

  • checks if my service is reachable
  • queries OCI health metrics
  • automatically resets the VM if needed

It runs from my homelab and just works. You can find it here.

The Problem

I’m running a Pangolin instance on Oracle Cloud Infrastructure (Free Tier).

Overall? Surprisingly solid.

But every now and then:

  • the service stops responding
  • the tunnel is dead
  • HTTP just hangs

From the outside, it looks like the VM is gone.

But when I check OCI:

šŸ‘‰ The instance is still running


ā€œDoesn’t Oracle detect that?ā€

Actually… yes and no.

OCI provides an infrastructure metric:

instance_status
  • 0 ā†’ everything fine
  • 1 ā†’ infrastructure problem

So Oracle does know when something is wrong at the hypervisor level.

But here’s the catch:

Your application can be completely dead while OCI still says everything is fine.

Existing solutions

There is already a project for this:

šŸ‘‰ https://github.com/afreidah/oracle-watchdog

It’s a solid approach, but for my setup it felt a bit heavy.

I didn’t want:

  • additional infrastructure
  • complex deployments
  • something I don’t fully understand

The idea

Instead of relying purely on OCI, I flipped it around:

šŸ‘‰ Check from the outside — like a real user would

Then combine that with OCI data.


The approach

With the help of AI I built a small Bash script that:

  1. Decides what to do

Queries OCI:

instance_status

Checks my public endpoint:

https://pangolin.roksblog.de

The logic

If OCI reports problem → resetElse if HTTP fails multiple times → resetElse → do nothing

Simple. Transparent. No magic.


Why this works better than OCI alone

Because it reflects reality.

OCI sees:

  • VM is up
  • hypervisor is fine

But your users experience:

  • timeouts
  • broken service
  • no response

šŸ‘‰ The HTTP check is the truth.


Making it safe

The important part is avoiding ā€œpanic rebootsā€.

So the script includes:

🧠 Failure threshold

Only reset after multiple failed checks (e.g. 3)

ā± Cooldown

Prevent reboot loops

🌐 Smart HTTP handling

Only treat real failures as errors:

  • timeout (000)
  • 500502503504

Not:

  • redirects
  • auth errors
  • 404s

Bonus: Telegram alerts

Because silent restarts are boring.

The script sends notifications when:

  • a failure is detected
  • a reset is triggered
  • the service recovers

Running it from my homelab

This is the part I like most:

šŸ‘‰ The watchdog runs outside OCI

That means:

  • it still works if the VM is completely dead
  • no dependency on the instance itself
  • no chicken-and-egg problem

Cron job:

*/10 * * * * ~/scripts/oci-watchdog/oci-watchdog.sh

OCI integration

Reset is done via CLI:

oci compute instance action \  --instance-id <INSTANCE_ID> \  --action RESET

Authentication is handled via:

~/.oci/config

Result

Since running this:

  • no more ā€œsilent hangsā€
  • automatic recovery
  • clear logs and notifications

And most importantly:

šŸ‘‰ I don’t have to think about it anymore


Lessons learned

This is one of those classic cloud lessons:

The provider gives you building blocks — not complete solutions.

OCI gives you:

  • metrics
  • APIs
  • compute

But:
šŸ‘‰ you still need to build the glue


Final thoughts

Could this be more ā€œenterpriseā€?

Sure:

  • OCI Functions
  • Monitoring alarms
  • full auto-healing pipelines

But honestly?

For a Free Tier setup:

šŸ‘‰ this simple watchdog is perfect


Repo

I’ve put the script and setup here:

šŸ‘‰ https://github.com/r0k5t4r/oci-watchdog


Closing thought

Sometimes the best solution is not:

ā€œmore cloudā€

but:

ā€œa small script that actually solves the problemā€
Privacy Policy Cookie Policy Terms and Conditions