Files

weston 6cbee11482 Phase 1 Complete: Foundation documentation

Added comprehensive homelab documentation:

README.md:
- Hardware inventory and specifications
- Network architecture overview
- Running services catalog
- Quick reference commands
- Project goals and roadmap

docs/network-map.md:
- All device IP assignments
- Port reference guide
- DNS configuration (Pi-hole + Unbound)
- Remote access setup (Tailscale + Cloudflare)
- Troubleshooting commands

docs/service-inventory.md:
- All 32 Docker containers cataloged
- Running services analysis (6 containers)
- Stopped services review (26 containers)
- Resource usage and recommendations
- Container decision matrix
- Cleanup plan to free 40GB
- Security recommendations
- Prioritized action plan

docs/quick-start.md:
- Emergency recovery procedures
- Service restart sequences
- Backup/restore guides with scripts
- Troubleshooting by scenario
- Health check automation
- Post-recovery checklist
- Common problem solutions

This establishes the foundation for all future homelab projects.
Phase 1 documentation complete! 🎉

2025-11-01 00:42:34 +01:00

20 KiB

Raw Permalink Blame History

🚀 Quick Start & Emergency Recovery Guide

Purpose: Get your homelab back online quickly after disaster
Target Time: 30-60 minutes to basic functionality
Last Updated: October 31, 2025

🎯 Quick Access Reference

Essential URLs

Service	URL	Default Credentials
Unraid Dashboard	http://192.168.68.51	root / (your password)
Gitea	https://gitea.segelschiff.app	Weston / (your password)
Vaultwarden	http://192.168.68.51:4743	Master password
NPM Admin	http://192.168.68.51:7818	admin@example.com / changeme (first login)
Pi-hole	http://192.168.68.61/admin	(your password)
PiKVM	https://192.168.68.53	admin / admin (default)

SSH Access

# Local network
ssh root@192.168.68.51

# Via Tailscale (from anywhere)
ssh root@100.122.220.126

# Emergency: Use PiKVM for console access
# https://192.168.68.53

🆘 Emergency Recovery Scenarios

Scenario 1: Server Won't Boot 🚨

Symptoms:

No network connectivity to 192.168.68.51
Unraid WebUI unreachable
No response to ping

Recovery Steps:

Physical Check (via PiKVM or in person)

[ ] Server has power (check LED)
[ ] Network cable connected to eth0
[ ] Monitor shows output (via PiKVM)
[ ] USB boot drive is present and detected

Use PiKVM for Remote Console
- Access: https://192.168.68.53
- Login: admin / admin
- View boot process
- Check BIOS/boot messages

Common Boot Issues

USB Boot Drive Failure (Most common!)

Symptoms: "Boot device not found" or similar

Fix:
1. Have backup USB ready
2. Shut down server (via PiKVM power control)
3. Replace USB boot drive
4. Power on
5. Restore configuration from backup

BIOS Settings Changed

Fix:
1. Enter BIOS (DEL/F2 during boot)
2. Load defaults
3. Verify boot order (USB first)
4. Save and exit

Hardware Failure

Check:
1. RAM seated properly
2. All drives detected in BIOS
3. CPU fan spinning
4. No error beeps

Boot from Backup USB

Steps:
1. Power off server
2. Insert backup USB boot drive
3. Power on
4. Verify boot successful
5. Restore configuration:
   - Tools → Flash Backup → Browse → Select backup ZIP
   - Reboot

Prevention:

✅ Keep USB flash backup updated (weekly)
✅ Store backup USB in safe location
✅ Document BIOS settings (screenshots via PiKVM)

Scenario 2: Lost Admin Password

Unraid Root Password Reset:

Via PiKVM Console

1. Access PiKVM: https://192.168.68.53
2. View console in browser
3. Wait for login prompt
4. Press Ctrl+Alt+F2 (via PiKVM keyboard)
5. At terminal: passwd root
6. Enter new password twice
7. Press Ctrl+Alt+F1 to return to GUI
8. Update documentation

Via Physical Access

1. Connect monitor and keyboard to server
2. Press Ctrl+Alt+F2
3. Run: passwd root
4. Set new password
5. Press Ctrl+Alt+F1

Container Passwords:

Check /mnt/user/appdata/<service>/config
Review environment variables in Docker templates
Use Vaultwarden if accessible
Check this documentation repo in Gitea

Scenario 3: Container Won't Start

Quick Diagnosis:

# Check container status
docker ps -a | grep <container_name>

# View recent logs
docker logs --tail 100 <container_name>

# Look for errors
docker inspect <container_name> | grep -i error

Common Fixes:

Port Conflict:

# Find what's using the port
netstat -tulpn | grep <port>

# Example: Port 3000 already in use
netstat -tulpn | grep 3000

# Stop conflicting service
docker stop <conflicting_container>

Volume Permission Issues:

# Check ownership
ls -la /mnt/user/appdata/<container_name>

# Fix permissions (Unraid standard: 99:100)
chown -R 99:100 /mnt/user/appdata/<container_name>

# Example: Fix Vaultwarden
chown -R 99:100 /mnt/user/appdata/vaultwarden

Dependency Missing:

# Example: Guacamole needs MariaDB
docker start mariadb
sleep 10  # Wait for database initialization
docker start ApacheGuacamole

# Verify dependency is running
docker ps | grep mariadb

Resource Exhaustion:

# Check cache usage
df -h /mnt/cache

# If cache full (>90%), clean up
docker system prune -a  # ⚠️ REMOVES UNUSED IMAGES!

# Or free space manually
# See service-inventory.md for cleanup recommendations

Scenario 4: Network Connectivity Issues

Can't Access from LAN:

# SSH into Unraid (via PiKVM if network down)
ssh root@192.168.68.51

# Check if br0 is up
ip addr show br0
# Should show: 192.168.68.51/22

# Verify IP and routes
ip route | grep default
# Should show: default via 192.168.68.1

# Test router connectivity
ping -c 3 192.168.68.1

# Test internet
ping -c 3 8.8.8.8

# Test DNS (Pi-hole)
nslookup google.com 192.168.68.61

Fix Network Issues:

# Restart networking (from console/PiKVM)
/etc/rc.d/rc.inet1 restart

# If that doesn't work, reboot
reboot

Can't Access Containers:

# Check Docker network
docker network inspect bridge

# Verify container IP
docker inspect <container_name> | grep IPAddress

# Test from Unraid host
curl http://172.17.0.5:8080  # Example: open-webui

# Test port mapping
curl http://192.168.68.51:3000  # Should reach open-webui

DNS Not Resolving:

# Test Pi-hole directly
nslookup google.com 192.168.68.61

# If Pi-hole down, check Pi Zero
ping 192.168.68.61

# SSH to Pi-hole
ssh pi@192.168.68.61

# Check Pi-hole status
pihole status

# Restart if needed
pihole restartdns

Scenario 5: Array Won't Start

Symptoms:

Unraid GUI accessible but array shows "Stopped"
Disks show errors or missing

Troubleshooting:

# Check disk health
smartctl -a /dev/sdb  # Parity
smartctl -a /dev/sdc  # Disk 1

# View disk assignments
cat /boot/config/disk.cfg

# Check for filesystem errors (read-only check)
xfs_repair -n /dev/md1p1

Common Causes:

Parity sync in progress (wait for completion)
Disk failed (check SMART, may need replacement)
Unclean shutdown (filesystem check required)
Disk assignment changed

Recovery:

Start Array in Maintenance Mode
- Click "Start" in Unraid GUI
- Select "Maintenance mode" if prompted
- Run filesystem check if prompted
Review Logs
- Settings → System Log
- Look for disk errors
- Check for power events
If Disk Failed
- Follow Unraid disk replacement procedure
- Do NOT format or write to disk unnecessarily
- Seek help in Unraid forums if uncertain

🔧 Critical Service Restart Procedures

Restart Core Services (Proper Order)

1. Infrastructure First:

# Start reverse proxy (for routing)
docker start NginxProxyManager

# Wait for it to be ready
sleep 5
docker ps | grep NginxProxyManager

# Start tunnel (for remote access)
docker start Cloudflared

# Verify both running
docker ps | grep -E "NginxProxyManager|Cloudflared"

2. Security Services:

# Password manager (critical!)
docker start vaultwarden

# Wait for healthy status
sleep 10
docker ps | grep vaultwarden
# Should show "(healthy)"

# If not healthy, check logs
docker logs --tail 50 vaultwarden

3. Development Tools:

# Git server
docker start Gitea

# Wait for initialization
sleep 5

# Remote access gateway
docker start ApacheGuacamole
# Note: Needs MariaDB if configured

4. Monitoring (IMPORTANT!):

# Database first
docker start Influxdb

# Wait for DB to initialize
sleep 15

# Then metrics collector
docker start Telegraf

# Finally visualization
docker start Grafana

# Verify all running
docker ps | grep -E "Influxdb|Telegraf|Grafana"

5. Optional Services:

# LLM backend
docker start ollama
sleep 10

# LLM interface
docker start open-webui

# Wait for healthy
docker ps | grep open-webui

Stop All Services Gracefully

# Stop all running containers
docker stop $(docker ps -q)

# Verify all stopped
docker ps
# Should show empty output

# Wait before stopping array
sleep 5

# Stop array (from GUI)
# Main → Array Operation → Stop

📦 Backup & Restore Procedures

USB Flash Backup (Unraid Configuration)

Create Backup:

Navigate to: Main → Flash → Flash Backup
Click "Backup Now"
Download ZIP file (e.g., unraid-flash-backup-20251031.zip)
Store securely OFF-SERVER:
- OneDrive: /z_Unraid/Backups/
- External drive
- Cloud storage

Restore from Backup:

1. Format new USB drive (if needed)
2. Copy backup ZIP to new USB
3. Extract contents to root of USB
   - config/ directory
   - bzimage, bzroot, etc.
4. Safely eject USB
5. Boot from new USB
6. Configuration restored automatically

Frequency:

Weekly minimum
After ANY configuration change
Before major updates

Container Data Backup

Critical Directories:

Priority 1 (CRITICAL):
/mnt/user/appdata/vaultwarden/     🚨 Your passwords!
/mnt/user/appdata/gitea/            🚨 Your code repositories!

Priority 2 (Important):
/mnt/user/appdata/NginxProxyManager/  Proxy configs
/mnt/user/appdata/Grafana/            Dashboards
/mnt/user/appdata/Influxdb/           Metrics history

Priority 3 (Optional):
/mnt/user/appdata/open-webui/         LLM chat history

Quick Backup Script:

#!/bin/bash
# Save as: /mnt/user/scripts/backup-critical.sh

BACKUP_DIR="/mnt/user/backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"

echo "Stopping containers..."
docker stop vaultwarden Gitea NginxProxyManager

echo "Backing up data..."
tar -czf "$BACKUP_DIR/vaultwarden.tar.gz" /mnt/user/appdata/vaultwarden
tar -czf "$BACKUP_DIR/gitea.tar.gz" /mnt/user/appdata/gitea
tar -czf "$BACKUP_DIR/npm.tar.gz" /mnt/user/appdata/NginxProxyManager

echo "Restarting containers..."
docker start vaultwarden Gitea NginxProxyManager

echo "✅ Backup complete: $BACKUP_DIR"
ls -lh "$BACKUP_DIR"

Make Executable:

chmod +x /mnt/user/scripts/backup-critical.sh

Run Manually:

/mnt/user/scripts/backup-critical.sh

Schedule (User Scripts Plugin):

Frequency: Daily at 2 AM
Retention: Keep last 30 days

Restore from Backup:

# Example: Restore Vaultwarden
docker stop vaultwarden

# Backup current (corrupted) data
mv /mnt/user/appdata/vaultwarden /mnt/user/appdata/vaultwarden.old

# Extract backup
tar -xzf /mnt/user/backups/20251031_120000/vaultwarden.tar.gz -C /

# Restart container
docker start vaultwarden

# Verify working
curl http://192.168.68.51:4743

⚡ Quick Commands Reference

System Status

# System uptime and load
uptime

# Resource usage
free -h
df -h

# Array status
cat /proc/mdcmd

# Docker container summary
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"

# Temperature (if sensors installed)
sensors

# Disk health quick check
smartctl -H /dev/sdb  # Parity
smartctl -H /dev/sdc  # Disk 1

Docker Quick Commands

# Start all stopped containers
docker start $(docker ps -aq)

# Stop all running containers
docker stop $(docker ps -q)

# View logs (last 50 lines)
docker logs --tail 50 <container_name>

# Follow logs in real-time
docker logs -f <container_name>

# Restart container
docker restart <container_name>

# Remove container (⚠️ will lose non-volume data!)
docker rm <container_name>

# Clean up unused resources
docker system prune        # Safe cleanup
docker system prune -a     # ⚠️ Removes unused images too!
docker system prune --volumes  # ⚠️ Removes unused volumes!

Network Diagnostics

# Check all interfaces
ip addr show

# Test key infrastructure
ping -c 3 192.168.68.1   # Router
ping -c 3 192.168.68.51  # Unraid
ping -c 3 192.168.68.61  # Pi-hole
ping -c 3 8.8.8.8        # Internet

# DNS resolution test
nslookup google.com
nslookup google.com 192.168.68.61  # Test Pi-hole specifically

# Check listening ports
netstat -tulpn | grep LISTEN

# Test specific port
nc -zv 192.168.68.51 3002  # Example: Gitea
curl -I http://192.168.68.51:3002  # HTTP test

Quick Health Check Script

#!/bin/bash
# Save as: /mnt/user/scripts/health-check.sh

echo "=== Unraid Health Check ==="
echo ""

echo "1. Array Status:"
cat /proc/mdcmd | grep mdState

echo ""
echo "2. Running Containers:"
docker ps --format "table {{.Names}}\t{{.Status}}"

echo ""
echo "3. Disk Usage:"
df -h | grep -E "cache|disk1|Filesystem"

echo ""
echo "4. Network Connectivity:"
ping -c 2 192.168.68.1 >/dev/null 2>&1 && echo "  Router: ✅ OK" || echo "  Router: ❌ FAIL"
ping -c 2 8.8.8.8 >/dev/null 2>&1 && echo "  Internet: ✅ OK" || echo "  Internet: ❌ FAIL"
ping -c 2 192.168.68.61 >/dev/null 2>&1 && echo "  Pi-hole: ✅ OK" || echo "  Pi-hole: ❌ FAIL"

echo ""
echo "5. Critical Services:"
curl -s http://localhost:4743 >/dev/null && echo "  Vaultwarden: ✅ OK" || echo "  Vaultwarden: ❌ DOWN"
curl -s http://localhost:3002 >/dev/null && echo "  Gitea: ✅ OK" || echo "  Gitea: ❌ DOWN"
curl -s http://localhost:7818 >/dev/null && echo "  NPM: ✅ OK" || echo "  NPM: ❌ DOWN"

echo ""
echo "=== Health Check Complete ==="

Run: bash /mnt/user/scripts/health-check.sh

📞 Getting Help

Pre-flight Checks

Before asking for help, gather this information:

System Diagnostics
- Unraid WebGUI: Tools → Diagnostics → Download
- Creates ZIP with all logs

Container Logs

docker logs <container_name> > container-logs.txt

Network Configuration

ip addr show > network-config.txt
ip route show >> network-config.txt

Disk Status

smartctl -a /dev/sdb > disk-smart.txt
smartctl -a /dev/sdc >> disk-smart.txt

Community Resources

Unraid Forums: https://forums.unraid.net/
- Post diagnostics ZIP
- Be specific about symptoms
- Include what you've tried
r/unraid: https://reddit.com/r/unraid
- Quick questions
- Share diagnostics in pastebin
Discord: Unraid Official Discord
- Real-time help
- Active community

Emergency Contacts

ISP Support: [Your ISP Phone Number]
Unraid License: [Store in secure location]
USB Backup Location: [Document where stored]
Off-site Backup: [If applicable]

🎓 Post-Recovery Checklist

After restoring from disaster:

[ ] Unraid array started successfully
[ ] All critical services running
    [ ] NginxProxyManager
    [ ] Cloudflared  
    [ ] Vaultwarden
    [ ] Gitea
[ ] Network connectivity verified
    [ ] Can access Unraid WebUI
    [ ] Can ping router (192.168.68.1)
    [ ] Internet working
    [ ] DNS resolving (Pi-hole)
[ ] Vaultwarden accessible (test password retrieval)
[ ] Gitea accessible (verify repositories intact)
[ ] NPM routing working (test reverse proxy)
[ ] Monitoring stack restarted
    [ ] Grafana
    [ ] InfluxDB
    [ ] Telegraf
[ ] External access working
    [ ] Tailscale connected
    [ ] Cloudflare tunnel active
[ ] Backups verified and up-to-date
[ ] Documentation updated with lessons learned
[ ] Incident documented in change log (Gitea)

🔒 Security After Recovery

Immediately After Disaster Recovery:

Change Passwords (if compromise suspected)

[ ] Unraid root password
[ ] Vaultwarden master password
[ ] Container admin passwords
[ ] Pi-hole admin password
[ ] PiKVM password

Review Access Logs

# Check SSH attempts
grep "Failed password" /var/log/auth.log | tail -50

# Check NPM access
docker logs NginxProxyManager | grep -i error

# Check Gitea access
docker logs Gitea | grep -i login

Verify Firewall Rules
```
iptables -L -n -v
```

Check for Unauthorized Changes

# Review Docker containers
docker ps -a

# Check cron jobs
crontab -l

# Review network interfaces
ip addr show

📝 Documentation Updates After Incident

What to Document:

What Happened:
- Date/time of incident
- Symptoms observed
- Root cause (if determined)
- Duration of outage
What You Did:
- Steps taken to recover
- What worked / didn't work
- Resources used (forums, docs, etc.)
- Time to recovery
Lessons Learned:
- What could prevent this in future
- Process improvements needed
- Documentation gaps discovered
- Backup improvements needed
Action Items:
- Backups to implement/improve
- Monitoring to add
- Scripts to create
- Hardware to replace/upgrade

Where to Document:

Create incident report: docs/incidents/YYYY-MM-DD-incident-name.md
Update this quick-start guide with new procedures
Add to troubleshooting section if recurring issue
Commit to Gitea with detailed message

🚀 Normal Startup Sequence

From Cold Boot:

1. Power on server
   ↓
2. BIOS POST (~30 seconds)
   - Hardware check
   - Memory test
   - Drive detection
   ↓
3. Unraid boots from USB (~1-2 minutes)
   - Linux kernel loads
   - Unraid OS starts
   ↓
4. Network initializes
   - br0 interface up
   - Gets IP: 192.168.68.51
   ↓
5. Array auto-starts (if configured)
   - Parity disk: sdb
   - Data disk: sdc
   - Cache: nvme1n1p1
   ↓
6. Docker service starts
   - docker0 bridge created
   - Networks initialized
   ↓
7. Containers auto-start (if enabled)
   - Infrastructure services first
   - Then application services
   ↓
8. Services available (~3-5 minutes total)
   ✅ Ready to use!

Expected Boot Time: 3-5 minutes
If Taking Longer: Check system log for errors

🎯 Quick Health Check Command

Run After Any Restart:

# Quick one-liner health check
docker ps --format "table {{.Names}}\t{{.Status}}" && \
df -h | grep -E "cache|disk1" && \
ping -c 2 192.168.68.1 >/dev/null && echo "Network: OK" || echo "Network: FAIL"

Network Issues: See network-map.md
Service Details: See service-inventory.md
Container Configs: See docker-compose/ (when created)
Main Overview: See README.md

🆘 True Emergency - Complete System Down

If everything is down and you need immediate help:

Access via PiKVM
- https://192.168.68.53
- Get console access
- View what's happening
Check Physical Server
- Power LED on?
- Fans spinning?
- Drives spinning up?
- Network activity lights?
Try Safe Mode Boot
- Boot Unraid in Safe Mode (GUI mode)
- Diagnose from console
Community Help
- Unraid Discord (fastest response)
- Forums with diagnostics ZIP
- r/unraid for quick questions
Document Everything
- Take photos/screenshots via PiKVM
- Note exact error messages
- Record what you tried
- Timeline of events

💡 Pro Tips

Test Your Backups
- Restore test annually
- Verify data integrity
- Practice recovery procedures
Keep This Guide Accessible
- Save offline copy to phone/laptop
- Print critical sections
- Bookmark in browser
Automate Where Possible
- Schedule backup scripts
- Set up monitoring alerts
- Use User Scripts plugin
Document As You Go
- Update after fixing issues
- Add new procedures discovered
- Note what worked/didn't work

Last Updated: October 31, 2025
Next Review: Quarterly or after incidents
Maintained By: Weston

Remember: Most issues are recoverable. Stay calm, work methodically, document your steps, and don't hesitate to ask for help!

Keep this guide accessible even when the server is down!
💡 Pro Tip: Save a copy to your phone/laptop/OneDrive!

🚀 You've got this!

20 KiB Raw Permalink Blame History