Automated Backup to S3 | Overview

Jun 10, 2025

Automated S3 Backup Solution

This project is a robust, automated backup pipeline designed to create secure, versioned snapshots in AWS S3. The solution employs a sophisticated, multi-stage process to intelligently handle different types of directories, optimize for performance, and prioritize data integrity. It combines the local efficiency of rsync and git with the scalable, durable storage of the cloud, providing a comprehensive disaster recovery strategy for critical development environments.

Tech Stack:

  • Languages & Core Libraries: Bash
  • Cloud & DevOps: AWS S3

Backup Solutions: An Overview

This summary compares the key trade-offs between four common backup archetypes, from simple local storage to a custom cloud solution.

Performance & Access

How quickly you can move data and, more importantly, get it back when you need it.

| Parameter | External HDD | Sync Service (GDrive) | Backup Service (Backblaze) | My Solution (Bash + S3) | | :--- | :--- | :--- | :--- | :--- | | Bulk Data Speed | ✅ ★★★★☆ | ❌ ★☆☆☆☆ | ❌ ★☆☆☆☆ | ❌ ★★☆☆☆ | | Incremental Speed | ⚠️ ★★☆☆☆ | ✅ ★★★★★ | ✅ ★★★★☆ | ✅ ★★★★★ | | Ease of Restore | ✅ ★★★★★ | ✅ ★★★★★ | ❌ ★★☆☆☆ | ⚠️ ★★★☆☆ |

Economics

The financial cost of the solution, both upfront and over time.

| Parameter | External HDD | Sync Service (GDrive) | Backup Service (Backblaze) | My Solution (Bash + S3) | | :--- | :--- | :--- | :--- | :--- | | Upfront Investment Barrier | ✅ ★★★★☆ | ✅ ★★★★★ | ✅ ★★★★★ | ✅ ★★★★★ | | Operating Cost (OPEX) | ✅ ★★★★★ | ⚠️ ★★★☆☆ | ✅ ★★★★☆ | ⚠️ ★★★☆☆ | | Marginal Cost (per GB) | ❌ ★★☆☆☆ | ⚠️ ★★★☆☆ | ✅ ★★★★★ | ✅ ★★★★★ |

Usability & Control

The balance between ease of use and the power to customize the process.

| Parameter | External HDD | Sync Service (GDrive) | Backup Service (Backblaze) | My Solution (Bash + S3) | | :--- | :--- | :--- | :--- | :--- | | Ease of Setup | ✅ ★★★★★ | ✅ ★★★★☆ | ✅ ★★★★★ | ❌ ★☆☆☆☆ | | Ease of Daily Backup | ⚠️ ★★★☆☆ | ✅ ★★★★★ | ✅ ★★★★★ | ✅ ★★★★★ | | Configuration & Control | ❌ ★☆☆☆☆ | ⚠️ ★★★☆☆ | ❌ ★☆☆☆☆ | ✅ ★★★★★ |

Data Safety & Resilience

How well the solution protects your data from different types of loss, from disaster to human error.

| Parameter | External HDD | Sync Service (GDrive) | Backup Service (Backblaze) | My Solution (Bash + S3) | | :--- | :--- | :--- | :--- | :--- | | Disaster Resilience | ❌ ★☆☆☆☆ | ✅ ★★★★★ | ✅ ★★★★★ | ✅ ★★★★★ | | Resilience to User Error | ❌ ★☆☆☆☆ | ❌ ★☆☆☆☆ | ✅ ★★★★★ | ✅ ★★★★★ | | File Versioning (Native) | ⚠️ ★★★☆☆ | ✅ ★★★★☆ | ✅ ★★★★★ | ✅ ★★★★★ |

Key Features

Git-Aware Incremental Snapshots

The pipeline's core strength is its intelligent local snapshotting capability. Before any data is sent to the cloud, it creates a clean, optimized staging copy of all specified source directories.

For directories that are Git repositories, it automatically respects .gitignore rules. This ensures that build artifacts, logs, and other ignored files are excluded from the backup, minimizing snapshot size and clutter.

rsync is used to build the local snapshot, ensuring that only new or modified files are copied to the staging area. This makes the snapshot creation process extremely fast after the initial run.

The entire backup process is driven by simple txt config files, allowing for easy addition or removal of directories without modifying the core logic.

Optimized & Parallelized Cloud Upload

Once the local snapshot is created, the system uses a highly optimized process (aws s3 sync) to synchronize the data with an AWS S3 bucket. It minimizes data transfer by only uploading files that are new or have changed since the last backup.

Orchestrated and Auditable Execution

The entire pipeline is managed by a master orchestration script that ensures a safe, predictable, and transparent execution, complete with built-in safety checks.

  • Multi-Stage Workflow: The backup is a deliberate two-step process (local snapshot -> cloud sync), which allows for verification and insulation between the live file system and the cloud destination.
  • Comprehensive Logging: Every action, from directory processing to S3 uploads, is timestamped and written to dedicated log files, providing a complete audit trail for diagnostics and verification.
  • Safety-First Design: The pipeline includes dedicated dry-run scripts and user confirmation prompts, allowing the user to review all proposed changes before any data is moved or deleted.