Files
homelab/docs/quick-start.md
weston 6cbee11482 Phase 1 Complete: Foundation documentation
Added comprehensive homelab documentation:

README.md:
- Hardware inventory and specifications
- Network architecture overview
- Running services catalog
- Quick reference commands
- Project goals and roadmap

docs/network-map.md:
- All device IP assignments
- Port reference guide
- DNS configuration (Pi-hole + Unbound)
- Remote access setup (Tailscale + Cloudflare)
- Troubleshooting commands

docs/service-inventory.md:
- All 32 Docker containers cataloged
- Running services analysis (6 containers)
- Stopped services review (26 containers)
- Resource usage and recommendations
- Container decision matrix
- Cleanup plan to free 40GB
- Security recommendations
- Prioritized action plan

docs/quick-start.md:
- Emergency recovery procedures
- Service restart sequences
- Backup/restore guides with scripts
- Troubleshooting by scenario
- Health check automation
- Post-recovery checklist
- Common problem solutions

This establishes the foundation for all future homelab projects.
Phase 1 documentation complete! 🎉
2025-11-01 00:42:34 +01:00

955 lines
20 KiB
Markdown

# 🚀 Quick Start & Emergency Recovery Guide
**Purpose:** Get your homelab back online quickly after disaster
**Target Time:** 30-60 minutes to basic functionality
**Last Updated:** October 31, 2025
---
## 🎯 Quick Access Reference
### Essential URLs
| Service | URL | Default Credentials |
|---------|-----|---------------------|
| **Unraid Dashboard** | http://192.168.68.51 | root / (your password) |
| **Gitea** | https://gitea.segelschiff.app | Weston / (your password) |
| **Vaultwarden** | http://192.168.68.51:4743 | Master password |
| **NPM Admin** | http://192.168.68.51:7818 | admin@example.com / changeme (first login) |
| **Pi-hole** | http://192.168.68.61/admin | (your password) |
| **PiKVM** | https://192.168.68.53 | admin / admin (default) |
### SSH Access
```bash
# Local network
ssh root@192.168.68.51
# Via Tailscale (from anywhere)
ssh root@100.122.220.126
# Emergency: Use PiKVM for console access
# https://192.168.68.53
```
---
## 🆘 Emergency Recovery Scenarios
### Scenario 1: Server Won't Boot 🚨
**Symptoms:**
- No network connectivity to 192.168.68.51
- Unraid WebUI unreachable
- No response to ping
**Recovery Steps:**
1. **Physical Check** (via PiKVM or in person)
```
[ ] Server has power (check LED)
[ ] Network cable connected to eth0
[ ] Monitor shows output (via PiKVM)
[ ] USB boot drive is present and detected
```
2. **Use PiKVM for Remote Console**
- Access: https://192.168.68.53
- Login: admin / admin
- View boot process
- Check BIOS/boot messages
3. **Common Boot Issues**
**USB Boot Drive Failure** (Most common!)
```
Symptoms: "Boot device not found" or similar
Fix:
1. Have backup USB ready
2. Shut down server (via PiKVM power control)
3. Replace USB boot drive
4. Power on
5. Restore configuration from backup
```
**BIOS Settings Changed**
```
Fix:
1. Enter BIOS (DEL/F2 during boot)
2. Load defaults
3. Verify boot order (USB first)
4. Save and exit
```
**Hardware Failure**
```
Check:
1. RAM seated properly
2. All drives detected in BIOS
3. CPU fan spinning
4. No error beeps
```
4. **Boot from Backup USB**
```
Steps:
1. Power off server
2. Insert backup USB boot drive
3. Power on
4. Verify boot successful
5. Restore configuration:
- Tools → Flash Backup → Browse → Select backup ZIP
- Reboot
```
**Prevention:**
- ✅ Keep USB flash backup updated (weekly)
- ✅ Store backup USB in safe location
- ✅ Document BIOS settings (screenshots via PiKVM)
---
### Scenario 2: Lost Admin Password
**Unraid Root Password Reset:**
1. **Via PiKVM Console**
```
1. Access PiKVM: https://192.168.68.53
2. View console in browser
3. Wait for login prompt
4. Press Ctrl+Alt+F2 (via PiKVM keyboard)
5. At terminal: passwd root
6. Enter new password twice
7. Press Ctrl+Alt+F1 to return to GUI
8. Update documentation
```
2. **Via Physical Access**
```
1. Connect monitor and keyboard to server
2. Press Ctrl+Alt+F2
3. Run: passwd root
4. Set new password
5. Press Ctrl+Alt+F1
```
**Container Passwords:**
- Check `/mnt/user/appdata/<service>/config`
- Review environment variables in Docker templates
- Use Vaultwarden if accessible
- Check this documentation repo in Gitea
---
### Scenario 3: Container Won't Start
**Quick Diagnosis:**
```bash
# Check container status
docker ps -a | grep <container_name>
# View recent logs
docker logs --tail 100 <container_name>
# Look for errors
docker inspect <container_name> | grep -i error
```
**Common Fixes:**
**Port Conflict:**
```bash
# Find what's using the port
netstat -tulpn | grep <port>
# Example: Port 3000 already in use
netstat -tulpn | grep 3000
# Stop conflicting service
docker stop <conflicting_container>
```
**Volume Permission Issues:**
```bash
# Check ownership
ls -la /mnt/user/appdata/<container_name>
# Fix permissions (Unraid standard: 99:100)
chown -R 99:100 /mnt/user/appdata/<container_name>
# Example: Fix Vaultwarden
chown -R 99:100 /mnt/user/appdata/vaultwarden
```
**Dependency Missing:**
```bash
# Example: Guacamole needs MariaDB
docker start mariadb
sleep 10 # Wait for database initialization
docker start ApacheGuacamole
# Verify dependency is running
docker ps | grep mariadb
```
**Resource Exhaustion:**
```bash
# Check cache usage
df -h /mnt/cache
# If cache full (>90%), clean up
docker system prune -a # ⚠️ REMOVES UNUSED IMAGES!
# Or free space manually
# See service-inventory.md for cleanup recommendations
```
---
### Scenario 4: Network Connectivity Issues
**Can't Access from LAN:**
```bash
# SSH into Unraid (via PiKVM if network down)
ssh root@192.168.68.51
# Check if br0 is up
ip addr show br0
# Should show: 192.168.68.51/22
# Verify IP and routes
ip route | grep default
# Should show: default via 192.168.68.1
# Test router connectivity
ping -c 3 192.168.68.1
# Test internet
ping -c 3 8.8.8.8
# Test DNS (Pi-hole)
nslookup google.com 192.168.68.61
```
**Fix Network Issues:**
```bash
# Restart networking (from console/PiKVM)
/etc/rc.d/rc.inet1 restart
# If that doesn't work, reboot
reboot
```
**Can't Access Containers:**
```bash
# Check Docker network
docker network inspect bridge
# Verify container IP
docker inspect <container_name> | grep IPAddress
# Test from Unraid host
curl http://172.17.0.5:8080 # Example: open-webui
# Test port mapping
curl http://192.168.68.51:3000 # Should reach open-webui
```
**DNS Not Resolving:**
```bash
# Test Pi-hole directly
nslookup google.com 192.168.68.61
# If Pi-hole down, check Pi Zero
ping 192.168.68.61
# SSH to Pi-hole
ssh pi@192.168.68.61
# Check Pi-hole status
pihole status
# Restart if needed
pihole restartdns
```
---
### Scenario 5: Array Won't Start
**Symptoms:**
- Unraid GUI accessible but array shows "Stopped"
- Disks show errors or missing
**Troubleshooting:**
```bash
# Check disk health
smartctl -a /dev/sdb # Parity
smartctl -a /dev/sdc # Disk 1
# View disk assignments
cat /boot/config/disk.cfg
# Check for filesystem errors (read-only check)
xfs_repair -n /dev/md1p1
```
**Common Causes:**
- Parity sync in progress (wait for completion)
- Disk failed (check SMART, may need replacement)
- Unclean shutdown (filesystem check required)
- Disk assignment changed
**Recovery:**
1. **Start Array in Maintenance Mode**
- Click "Start" in Unraid GUI
- Select "Maintenance mode" if prompted
- Run filesystem check if prompted
2. **Review Logs**
- Settings → System Log
- Look for disk errors
- Check for power events
3. **If Disk Failed**
- Follow Unraid disk replacement procedure
- Do NOT format or write to disk unnecessarily
- Seek help in Unraid forums if uncertain
---
## 🔧 Critical Service Restart Procedures
### Restart Core Services (Proper Order)
**1. Infrastructure First:**
```bash
# Start reverse proxy (for routing)
docker start NginxProxyManager
# Wait for it to be ready
sleep 5
docker ps | grep NginxProxyManager
# Start tunnel (for remote access)
docker start Cloudflared
# Verify both running
docker ps | grep -E "NginxProxyManager|Cloudflared"
```
**2. Security Services:**
```bash
# Password manager (critical!)
docker start vaultwarden
# Wait for healthy status
sleep 10
docker ps | grep vaultwarden
# Should show "(healthy)"
# If not healthy, check logs
docker logs --tail 50 vaultwarden
```
**3. Development Tools:**
```bash
# Git server
docker start Gitea
# Wait for initialization
sleep 5
# Remote access gateway
docker start ApacheGuacamole
# Note: Needs MariaDB if configured
```
**4. Monitoring (IMPORTANT!):**
```bash
# Database first
docker start Influxdb
# Wait for DB to initialize
sleep 15
# Then metrics collector
docker start Telegraf
# Finally visualization
docker start Grafana
# Verify all running
docker ps | grep -E "Influxdb|Telegraf|Grafana"
```
**5. Optional Services:**
```bash
# LLM backend
docker start ollama
sleep 10
# LLM interface
docker start open-webui
# Wait for healthy
docker ps | grep open-webui
```
---
### Stop All Services Gracefully
```bash
# Stop all running containers
docker stop $(docker ps -q)
# Verify all stopped
docker ps
# Should show empty output
# Wait before stopping array
sleep 5
# Stop array (from GUI)
# Main → Array Operation → Stop
```
---
## 📦 Backup & Restore Procedures
### USB Flash Backup (Unraid Configuration)
**Create Backup:**
1. Navigate to: **Main → Flash → Flash Backup**
2. Click "Backup Now"
3. Download ZIP file (e.g., `unraid-flash-backup-20251031.zip`)
4. Store securely OFF-SERVER:
- OneDrive: `/z_Unraid/Backups/`
- External drive
- Cloud storage
**Restore from Backup:**
```
1. Format new USB drive (if needed)
2. Copy backup ZIP to new USB
3. Extract contents to root of USB
- config/ directory
- bzimage, bzroot, etc.
4. Safely eject USB
5. Boot from new USB
6. Configuration restored automatically
```
**Frequency:**
- Weekly minimum
- After ANY configuration change
- Before major updates
---
### Container Data Backup
**Critical Directories:**
```
Priority 1 (CRITICAL):
/mnt/user/appdata/vaultwarden/ 🚨 Your passwords!
/mnt/user/appdata/gitea/ 🚨 Your code repositories!
Priority 2 (Important):
/mnt/user/appdata/NginxProxyManager/ Proxy configs
/mnt/user/appdata/Grafana/ Dashboards
/mnt/user/appdata/Influxdb/ Metrics history
Priority 3 (Optional):
/mnt/user/appdata/open-webui/ LLM chat history
```
**Quick Backup Script:**
```bash
#!/bin/bash
# Save as: /mnt/user/scripts/backup-critical.sh
BACKUP_DIR="/mnt/user/backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
echo "Stopping containers..."
docker stop vaultwarden Gitea NginxProxyManager
echo "Backing up data..."
tar -czf "$BACKUP_DIR/vaultwarden.tar.gz" /mnt/user/appdata/vaultwarden
tar -czf "$BACKUP_DIR/gitea.tar.gz" /mnt/user/appdata/gitea
tar -czf "$BACKUP_DIR/npm.tar.gz" /mnt/user/appdata/NginxProxyManager
echo "Restarting containers..."
docker start vaultwarden Gitea NginxProxyManager
echo "✅ Backup complete: $BACKUP_DIR"
ls -lh "$BACKUP_DIR"
```
**Make Executable:**
```bash
chmod +x /mnt/user/scripts/backup-critical.sh
```
**Run Manually:**
```bash
/mnt/user/scripts/backup-critical.sh
```
**Schedule (User Scripts Plugin):**
- Frequency: Daily at 2 AM
- Retention: Keep last 30 days
---
**Restore from Backup:**
```bash
# Example: Restore Vaultwarden
docker stop vaultwarden
# Backup current (corrupted) data
mv /mnt/user/appdata/vaultwarden /mnt/user/appdata/vaultwarden.old
# Extract backup
tar -xzf /mnt/user/backups/20251031_120000/vaultwarden.tar.gz -C /
# Restart container
docker start vaultwarden
# Verify working
curl http://192.168.68.51:4743
```
---
## ⚡ Quick Commands Reference
### System Status
```bash
# System uptime and load
uptime
# Resource usage
free -h
df -h
# Array status
cat /proc/mdcmd
# Docker container summary
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"
# Temperature (if sensors installed)
sensors
# Disk health quick check
smartctl -H /dev/sdb # Parity
smartctl -H /dev/sdc # Disk 1
```
### Docker Quick Commands
```bash
# Start all stopped containers
docker start $(docker ps -aq)
# Stop all running containers
docker stop $(docker ps -q)
# View logs (last 50 lines)
docker logs --tail 50 <container_name>
# Follow logs in real-time
docker logs -f <container_name>
# Restart container
docker restart <container_name>
# Remove container (⚠️ will lose non-volume data!)
docker rm <container_name>
# Clean up unused resources
docker system prune # Safe cleanup
docker system prune -a # ⚠️ Removes unused images too!
docker system prune --volumes # ⚠️ Removes unused volumes!
```
### Network Diagnostics
```bash
# Check all interfaces
ip addr show
# Test key infrastructure
ping -c 3 192.168.68.1 # Router
ping -c 3 192.168.68.51 # Unraid
ping -c 3 192.168.68.61 # Pi-hole
ping -c 3 8.8.8.8 # Internet
# DNS resolution test
nslookup google.com
nslookup google.com 192.168.68.61 # Test Pi-hole specifically
# Check listening ports
netstat -tulpn | grep LISTEN
# Test specific port
nc -zv 192.168.68.51 3002 # Example: Gitea
curl -I http://192.168.68.51:3002 # HTTP test
```
### Quick Health Check Script
```bash
#!/bin/bash
# Save as: /mnt/user/scripts/health-check.sh
echo "=== Unraid Health Check ==="
echo ""
echo "1. Array Status:"
cat /proc/mdcmd | grep mdState
echo ""
echo "2. Running Containers:"
docker ps --format "table {{.Names}}\t{{.Status}}"
echo ""
echo "3. Disk Usage:"
df -h | grep -E "cache|disk1|Filesystem"
echo ""
echo "4. Network Connectivity:"
ping -c 2 192.168.68.1 >/dev/null 2>&1 && echo " Router: ✅ OK" || echo " Router: ❌ FAIL"
ping -c 2 8.8.8.8 >/dev/null 2>&1 && echo " Internet: ✅ OK" || echo " Internet: ❌ FAIL"
ping -c 2 192.168.68.61 >/dev/null 2>&1 && echo " Pi-hole: ✅ OK" || echo " Pi-hole: ❌ FAIL"
echo ""
echo "5. Critical Services:"
curl -s http://localhost:4743 >/dev/null && echo " Vaultwarden: ✅ OK" || echo " Vaultwarden: ❌ DOWN"
curl -s http://localhost:3002 >/dev/null && echo " Gitea: ✅ OK" || echo " Gitea: ❌ DOWN"
curl -s http://localhost:7818 >/dev/null && echo " NPM: ✅ OK" || echo " NPM: ❌ DOWN"
echo ""
echo "=== Health Check Complete ==="
```
**Run:** `bash /mnt/user/scripts/health-check.sh`
---
## 📞 Getting Help
### Pre-flight Checks
Before asking for help, gather this information:
1. **System Diagnostics**
- Unraid WebGUI: Tools → Diagnostics → Download
- Creates ZIP with all logs
2. **Container Logs**
```bash
docker logs <container_name> > container-logs.txt
```
3. **Network Configuration**
```bash
ip addr show > network-config.txt
ip route show >> network-config.txt
```
4. **Disk Status**
```bash
smartctl -a /dev/sdb > disk-smart.txt
smartctl -a /dev/sdc >> disk-smart.txt
```
### Community Resources
- **Unraid Forums:** https://forums.unraid.net/
- Post diagnostics ZIP
- Be specific about symptoms
- Include what you've tried
- **r/unraid:** https://reddit.com/r/unraid
- Quick questions
- Share diagnostics in pastebin
- **Discord:** Unraid Official Discord
- Real-time help
- Active community
### Emergency Contacts
```
ISP Support: [Your ISP Phone Number]
Unraid License: [Store in secure location]
USB Backup Location: [Document where stored]
Off-site Backup: [If applicable]
```
---
## 🎓 Post-Recovery Checklist
After restoring from disaster:
```
[ ] Unraid array started successfully
[ ] All critical services running
[ ] NginxProxyManager
[ ] Cloudflared
[ ] Vaultwarden
[ ] Gitea
[ ] Network connectivity verified
[ ] Can access Unraid WebUI
[ ] Can ping router (192.168.68.1)
[ ] Internet working
[ ] DNS resolving (Pi-hole)
[ ] Vaultwarden accessible (test password retrieval)
[ ] Gitea accessible (verify repositories intact)
[ ] NPM routing working (test reverse proxy)
[ ] Monitoring stack restarted
[ ] Grafana
[ ] InfluxDB
[ ] Telegraf
[ ] External access working
[ ] Tailscale connected
[ ] Cloudflare tunnel active
[ ] Backups verified and up-to-date
[ ] Documentation updated with lessons learned
[ ] Incident documented in change log (Gitea)
```
---
## 🔒 Security After Recovery
**Immediately After Disaster Recovery:**
1. **Change Passwords** (if compromise suspected)
```
[ ] Unraid root password
[ ] Vaultwarden master password
[ ] Container admin passwords
[ ] Pi-hole admin password
[ ] PiKVM password
```
2. **Review Access Logs**
```bash
# Check SSH attempts
grep "Failed password" /var/log/auth.log | tail -50
# Check NPM access
docker logs NginxProxyManager | grep -i error
# Check Gitea access
docker logs Gitea | grep -i login
```
3. **Verify Firewall Rules**
```bash
iptables -L -n -v
```
4. **Check for Unauthorized Changes**
```bash
# Review Docker containers
docker ps -a
# Check cron jobs
crontab -l
# Review network interfaces
ip addr show
```
---
## 📝 Documentation Updates After Incident
**What to Document:**
1. **What Happened:**
- Date/time of incident
- Symptoms observed
- Root cause (if determined)
- Duration of outage
2. **What You Did:**
- Steps taken to recover
- What worked / didn't work
- Resources used (forums, docs, etc.)
- Time to recovery
3. **Lessons Learned:**
- What could prevent this in future
- Process improvements needed
- Documentation gaps discovered
- Backup improvements needed
4. **Action Items:**
- Backups to implement/improve
- Monitoring to add
- Scripts to create
- Hardware to replace/upgrade
**Where to Document:**
- Create incident report: `docs/incidents/YYYY-MM-DD-incident-name.md`
- Update this quick-start guide with new procedures
- Add to troubleshooting section if recurring issue
- Commit to Gitea with detailed message
---
## 🚀 Normal Startup Sequence
**From Cold Boot:**
```
1. Power on server
2. BIOS POST (~30 seconds)
- Hardware check
- Memory test
- Drive detection
3. Unraid boots from USB (~1-2 minutes)
- Linux kernel loads
- Unraid OS starts
4. Network initializes
- br0 interface up
- Gets IP: 192.168.68.51
5. Array auto-starts (if configured)
- Parity disk: sdb
- Data disk: sdc
- Cache: nvme1n1p1
6. Docker service starts
- docker0 bridge created
- Networks initialized
7. Containers auto-start (if enabled)
- Infrastructure services first
- Then application services
8. Services available (~3-5 minutes total)
✅ Ready to use!
```
**Expected Boot Time:** 3-5 minutes
**If Taking Longer:** Check system log for errors
---
## 🎯 Quick Health Check Command
**Run After Any Restart:**
```bash
# Quick one-liner health check
docker ps --format "table {{.Names}}\t{{.Status}}" && \
df -h | grep -E "cache|disk1" && \
ping -c 2 192.168.68.1 >/dev/null && echo "Network: OK" || echo "Network: FAIL"
```
---
## 📚 Related Documentation
- **Network Issues:** See `network-map.md`
- **Service Details:** See `service-inventory.md`
- **Container Configs:** See `docker-compose/` (when created)
- **Main Overview:** See `README.md`
---
## 🆘 True Emergency - Complete System Down
**If everything is down and you need immediate help:**
1. **Access via PiKVM**
- https://192.168.68.53
- Get console access
- View what's happening
2. **Check Physical Server**
- Power LED on?
- Fans spinning?
- Drives spinning up?
- Network activity lights?
3. **Try Safe Mode Boot**
- Boot Unraid in Safe Mode (GUI mode)
- Diagnose from console
4. **Community Help**
- Unraid Discord (fastest response)
- Forums with diagnostics ZIP
- r/unraid for quick questions
5. **Document Everything**
- Take photos/screenshots via PiKVM
- Note exact error messages
- Record what you tried
- Timeline of events
---
## 💡 Pro Tips
1. **Test Your Backups**
- Restore test annually
- Verify data integrity
- Practice recovery procedures
2. **Keep This Guide Accessible**
- Save offline copy to phone/laptop
- Print critical sections
- Bookmark in browser
3. **Automate Where Possible**
- Schedule backup scripts
- Set up monitoring alerts
- Use User Scripts plugin
4. **Document As You Go**
- Update after fixing issues
- Add new procedures discovered
- Note what worked/didn't work
---
**Last Updated:** October 31, 2025
**Next Review:** Quarterly or after incidents
**Maintained By:** Weston
---
**Remember:** Most issues are recoverable. Stay calm, work methodically, document your steps, and don't hesitate to ask for help!
**Keep this guide accessible even when the server is down!**
💡 **Pro Tip:** Save a copy to your phone/laptop/OneDrive!
🚀 **You've got this!**