← All Posts

Migrating from DSE to Astra DB: Lessons Learned

2024-12-15
CassandraMigrationCloudSparkAstra DB

Migrating from DSE to Astra DB: Lessons Learned

When we set out to migrate our trading platform from on-premise DSE clusters to DataStax Astra DB on Azure, we knew it would be a complex undertaking. The platform serves real-time financial transactions across multiple data centres with strict latency and consistency requirements. Here's how we designed and executed a zero-downtime migration.

The Challenge

Our on-premise DSE clusters were approaching end-of-life, tightly coupled with Solr-based search indexes and legacy Spark jobs. The migration needed to be:

  • Zero-downtime — the platform serves real-time financial transactions
  • Data-consistent — financial data cannot tolerate inconsistency
  • Reversible — we needed a rollback plan at every stage
  • Secure — PrivateLink, TLS, SSO, and BYOK encryption from day one

Architecture: Zero Downtime Migration Proxy

Migration Proxy Architecture

The centrepiece of our approach was a transparent proxy layer between applications and the database clusters, enabling seamless traffic routing without application changes.

Clustered Proxy for High Availability

We deployed the proxy in a clustered configuration per data centre. Each API service connected to multiple proxy instances, providing high availability. The proxy uses topology-aware configuration so the application driver receives consistent cluster information regardless of which instance it connects to.

# Proxy cluster topology (conceptual)
PROXY_CLUSTER_ADDRESSES: [node1, node2, node3]
PROXY_NODE_INDEX: 

Driver sees consistent topology:

system.local → current proxy

system.peers → all other proxies

Feature Toggle Strategy

We designed a comprehensive toggle system to control every aspect of the migration independently:

CategoryPurpose
Connection routingSwitch between on-prem and cloud connection details
Credential selectionControl which credentials the proxy uses for the target
Read routingDirect reads to origin (on-prem) or target (cloud)
Query syntaxSwitch between legacy search and cloud-native query syntax
Dual writesEnable parallel writes to both clusters from batch jobs
ReconciliationEnable data validation against the cloud target
This granular control allowed us to progress through migration stages independently per component and per data centre, with instant rollback capability.

Phased Cutover Plan

Phased Cutover Strategy

We executed the migration in four carefully orchestrated stages:

Stage 1 — Deploy proxy layer with target pointing to on-prem. All reads from on-prem. Validates the proxy works transparently with zero application impact.

Stage 2 — Migrate historical data to cloud. Enable dual writes. Reads still from on-prem. Reconciliation jobs validate data consistency between clusters.

Stage 3 — Cutover reads to cloud on the first data centre only. Enable cloud-native queries. Monitor for issues before proceeding.

Stage 4 — Cutover reads to cloud on the second data centre. Full migration complete.

Each stage had a documented fallback procedure that could restore service within the RTO window by toggling reads back to origin.

Replacing Solr with Storage-Attached Indexing

Solr to SAI Migration Patterns

One of the biggest technical challenges was replacing 75+ Solr indexes with Storage-Attached Indexing (SAI), given cloud guardrails on index counts per database.

Our remediation patterns included:

  • Predicative search columns — combining multiple text fields into a single column with an edgengram analyzer for type-ahead search
  • Collection columns — using CONTAINS queries on sets/maps instead of OR predicates
  • Lookup tables — providing keys to filter on existing tables without requiring indexes
  • Composite columns — combining fields always used together into a single indexed column
  • Schema simplification — removing unnecessary partition key columns to enable pure CQL queries

Dual Write Framework for Batch Jobs

Rather than routing batch jobs through the proxy (which would require reading from the cloud schema), we implemented dual writes directly in the processing layer:

// Dual write pattern (pseudo-code)
function processAndWrite(dataFrame, table):
    dataFrame.writeTo(onPremCluster, table)

if featureToggle("dual_write_enabled"): dataFrame.writeTo(cloudCluster, table)

This kept reads and joins on the existing on-prem cluster (avoiding schema differences) while ensuring the cloud target stayed in sync.

Security Architecture

The solution implemented enterprise-grade security controls:

  • Private connectivity — all traffic between on-prem and cloud via private endpoints, no public internet exposure
  • Customer-managed encryption — bring-your-own-key (BYOK) for data at rest
  • Enterprise SSO — console access via corporate identity provider
  • Privileged access — database access restricted to designated nodes via privileged access management
  • SIEM integration — audit logging enabled for compliance and monitoring

Data Reconciliation

We built automated reconciliation to continuously validate data consistency between clusters. A nightly job compared records across all migrated tables, with results logged for audit and alerting on any discrepancies.

// Reconciliation (pseudo-code)
function reconcile(table):
    onPremData = readAll(onPremCluster, table)
    cloudData  = readAll(cloudCluster, table)
    diff = compare(onPremData, cloudData)

if diff.hasDiscrepancies(): alert(diff.summary()) log(diff.details())

Key Takeaways

  • Proxy layer is essential — it de-risks migration by enabling toggle-based routing with no application redeployment
  • Feature toggles at every layer — independent control of APIs, batch jobs, and query behaviour enables granular rollback
  • Phased DC-by-DC cutover — validate in one data centre before rolling to the next
  • Plan index strategy early — cloud platforms have guardrails on index counts that require upfront schema redesign
  • Dual writes over proxy for batch — when schemas differ, direct dual writes are simpler than routing through the proxy
  • Security first — private connectivity, BYOK, and SSO from day one, not bolted on later

Result

The migration completed on schedule with zero downtime, achieving a 40% reduction in TCO and earning recognition through the 2024 Westpac Technology Award for leadership impact in delivering the platform modernisation.