Automated S3 Backup Solution
This project is a robust, automated backup pipeline designed to create secure, versioned snapshots in AWS S3. The solution employs a sophisticated, multi-stage process to intelligently handle different types of directories, optimize for performance, and prioritize data integrity. It combines the local efficiency of rsync and git with the scalable, durable storage of the cloud, providing a comprehensive disaster recovery strategy for critical development environments.
Tech Stack:
- Languages & Core Libraries: Bash
- Cloud & DevOps: AWS S3
Backup Solutions: An Overview
This summary compares the key trade-offs between four common backup archetypes, from simple local storage to a custom cloud solution.
Performance & Access
How quickly you can move data and, more importantly, get it back when you need it.
| Parameter | External HDD | Sync Service (GDrive) | Backup Service (Backblaze) | My Solution (Bash + S3) | | :--- | :--- | :--- | :--- | :--- | | Bulk Data Speed | ✅ ★★★★☆ | ❌ ★☆☆☆☆ | ❌ ★☆☆☆☆ | ❌ ★★☆☆☆ | | Incremental Speed | ⚠️ ★★☆☆☆ | ✅ ★★★★★ | ✅ ★★★★☆ | ✅ ★★★★★ | | Ease of Restore | ✅ ★★★★★ | ✅ ★★★★★ | ❌ ★★☆☆☆ | ⚠️ ★★★☆☆ |
Economics
The financial cost of the solution, both upfront and over time.
| Parameter | External HDD | Sync Service (GDrive) | Backup Service (Backblaze) | My Solution (Bash + S3) | | :--- | :--- | :--- | :--- | :--- | | Upfront Investment Barrier | ✅ ★★★★☆ | ✅ ★★★★★ | ✅ ★★★★★ | ✅ ★★★★★ | | Operating Cost (OPEX) | ✅ ★★★★★ | ⚠️ ★★★☆☆ | ✅ ★★★★☆ | ⚠️ ★★★☆☆ | | Marginal Cost (per GB) | ❌ ★★☆☆☆ | ⚠️ ★★★☆☆ | ✅ ★★★★★ | ✅ ★★★★★ |
Usability & Control
The balance between ease of use and the power to customize the process.
| Parameter | External HDD | Sync Service (GDrive) | Backup Service (Backblaze) | My Solution (Bash + S3) | | :--- | :--- | :--- | :--- | :--- | | Ease of Setup | ✅ ★★★★★ | ✅ ★★★★☆ | ✅ ★★★★★ | ❌ ★☆☆☆☆ | | Ease of Daily Backup | ⚠️ ★★★☆☆ | ✅ ★★★★★ | ✅ ★★★★★ | ✅ ★★★★★ | | Configuration & Control | ❌ ★☆☆☆☆ | ⚠️ ★★★☆☆ | ❌ ★☆☆☆☆ | ✅ ★★★★★ |
Data Safety & Resilience
How well the solution protects your data from different types of loss, from disaster to human error.
| Parameter | External HDD | Sync Service (GDrive) | Backup Service (Backblaze) | My Solution (Bash + S3) | | :--- | :--- | :--- | :--- | :--- | | Disaster Resilience | ❌ ★☆☆☆☆ | ✅ ★★★★★ | ✅ ★★★★★ | ✅ ★★★★★ | | Resilience to User Error | ❌ ★☆☆☆☆ | ❌ ★☆☆☆☆ | ✅ ★★★★★ | ✅ ★★★★★ | | File Versioning (Native) | ⚠️ ★★★☆☆ | ✅ ★★★★☆ | ✅ ★★★★★ | ✅ ★★★★★ |
Key Features
Git-Aware Incremental Snapshots
The pipeline's core strength is its intelligent local snapshotting capability. Before any data is sent to the cloud, it creates a clean, optimized staging copy of all specified source directories.
For directories that are Git repositories, it automatically respects .gitignore rules. This ensures that build artifacts, logs, and other ignored files are excluded from the backup, minimizing snapshot size and clutter.
rsync is used to build the local snapshot, ensuring that only new or modified files are copied to the staging area. This makes the snapshot creation process extremely fast after the initial run.
The entire backup process is driven by simple txt config files, allowing for easy addition or removal of directories without modifying the core logic.
Optimized & Parallelized Cloud Upload
Once the local snapshot is created, the system uses a highly optimized process (aws s3 sync) to synchronize the data with an AWS S3 bucket. It minimizes data transfer by only uploading files that are new or have changed since the last backup.
Orchestrated and Auditable Execution
The entire pipeline is managed by a master orchestration script that ensures a safe, predictable, and transparent execution, complete with built-in safety checks.
- Multi-Stage Workflow: The backup is a deliberate two-step process (local snapshot -> cloud sync), which allows for verification and insulation between the live file system and the cloud destination.
- Comprehensive Logging: Every action, from directory processing to S3 uploads, is timestamped and written to dedicated log files, providing a complete audit trail for diagnostics and verification.
- Safety-First Design: The pipeline includes dedicated dry-run scripts and user confirmation prompts, allowing the user to review all proposed changes before any data is moved or deleted.