š¦ Keeping My Pangolin Instance Alive on Oracle Cloud (With a Simple Watchdog)
TL;DR
Oracle Cloud can detect when your VM has problems ā but it wonāt automatically fix them.
I built a small watchdog script that:
- checks if my service is reachable
- queries OCI health metrics
- automatically resets the VM if needed
It runs from my homelab and just works. You can find it here.
The Problem
Iām running a Pangolin instance on Oracle Cloud Infrastructure (Free Tier).
Overall? Surprisingly solid.
But every now and then:
- the service stops responding
- the tunnel is dead
- HTTP just hangs
From the outside, it looks like the VM is gone.
But when I check OCI:
š The instance is still running
āDoesnāt Oracle detect that?ā
Actually⦠yes and no.
OCI provides an infrastructure metric:
instance_status0ā everything fine1ā infrastructure problem
So Oracle does know when something is wrong at the hypervisor level.
But hereās the catch:
Your application can be completely dead while OCI still says everything is fine.
Existing solutions
There is already a project for this:
š https://github.com/afreidah/oracle-watchdog
Itās a solid approach, but for my setup it felt a bit heavy.
I didnāt want:
- additional infrastructure
- complex deployments
- something I donāt fully understand
The idea
Instead of relying purely on OCI, I flipped it around:
š Check from the outside ā like a real user would
Then combine that with OCI data.
The approach
With the help of AI I built a small Bash script that:
- Decides what to do
Queries OCI:
instance_statusChecks my public endpoint:
https://pangolin.roksblog.deThe logic
If OCI reports problem ā resetElse if HTTP fails multiple times ā resetElse ā do nothingSimple. Transparent. No magic.
Why this works better than OCI alone
Because it reflects reality.
OCI sees:
- VM is up
- hypervisor is fine
But your users experience:
- timeouts
- broken service
- no response
š The HTTP check is the truth.
Making it safe
The important part is avoiding āpanic rebootsā.
So the script includes:
š§ Failure threshold
Only reset after multiple failed checks (e.g. 3)
ā± Cooldown
Prevent reboot loops
š Smart HTTP handling
Only treat real failures as errors:
- timeout (
000) 500,502,503,504
Not:
- redirects
- auth errors
- 404s
Bonus: Telegram alerts
Because silent restarts are boring.
The script sends notifications when:
- a failure is detected
- a reset is triggered
- the service recovers
Running it from my homelab
This is the part I like most:
š The watchdog runs outside OCI
That means:
- it still works if the VM is completely dead
- no dependency on the instance itself
- no chicken-and-egg problem
Cron job:
*/10 * * * * ~/scripts/oci-watchdog/oci-watchdog.shOCI integration
Reset is done via CLI:
oci compute instance action \ --instance-id <INSTANCE_ID> \ --action RESETAuthentication is handled via:
~/.oci/configResult
Since running this:
- no more āsilent hangsā
- automatic recovery
- clear logs and notifications
And most importantly:
š I donāt have to think about it anymore
Lessons learned
This is one of those classic cloud lessons:
The provider gives you building blocks ā not complete solutions.
OCI gives you:
- metrics
- APIs
- compute
But:
š you still need to build the glue
Final thoughts
Could this be more āenterpriseā?
Sure:
- OCI Functions
- Monitoring alarms
- full auto-healing pipelines
But honestly?
For a Free Tier setup:
š this simple watchdog is perfect
Repo
Iāve put the script and setup here:
š https://github.com/r0k5t4r/oci-watchdog
Closing thought
Sometimes the best solution is not:
āmore cloudā
but:
āa small script that actually solves the problemā