Public Cloud Storage and Database Review (May 2020 update)

The objective of this post is to analyze the Database and Storage Services including in memory cache offered by the four main providers of public cloud; AWS, GCP, Azure and Alibaba. In addition, the post will identify the use cases for each kind of Database and Storage taking into account factors like latency, consistency, storage capacity and Api Access.

Database and Storage Services

Database and Storage Services are one of the key services offered by Public Cloud to save and access Data. Unfortunately the CAP theorem states that it is impossible for a distributed data store (as offers by the Cloud Vendors) to simultaneously provide more than two out of the following three capabilities:

  • Consistency
  • Availability
  • Partition tolerance

So, you are force to choose the Database and Storage Services that better fit the needs of your use cases (in fact, your main decision is to choose between Consistency (or Transactionality) or Availability because Partition Tolerance is a must in a cloud environment).  There are other considerations like Latency, Storage Capacity, SLA, Multi Region support and Language access that we will review in the following chapters.

In general, AWS, GCP, Azure and Alibaba have structured his Database and Storage  Services as follows :

  • Databases
    • Relational
    • Non Relational
  • Storage
    • Object
    • Block
    • File
  • In Memory Cache

Relational Databases

A relational database is based on the relational model of data (as a collection of relations).

The Relational Model is based on the idea that each table will include a primary key or identifier. Other tables use that identifier to provide “relational” data links and results.

Most relational databases use the SQL data definition and query language.

The Cloud Providers offers two kinds of Managed Relational Databases:

  • Managed Market Relational Database like MySql, Postgres, SQL Server and MariaDB  with limited scalability and size. The Cloud provider offer different level of managed database services that makes it easy to set up, maintain, manage, and administer the market databases in the cloud.
  • Proprietary Cloud Provider Relational Database designed to scale in the Cloud Provider infrastructure.

The Market Relational Database applies when you have a legacy applications based on a market database and you don’t want to modify the application code or in the case that you don’t want to have a lock-in with the Cloud provider option.

In both situations you have to take into account the limited scalability offer by the cloud providers when you use a Market Database.

The Proprietary Cloud Provider Relational Database, on the other hand, offers a better horizontal scalability with lower price and others capabilities like Multi Region replication.

The main trick use by the Proprietary Cloud Provider Relational Database (and some market solutions) to support better horizontal scalability is the concept of replication. You have a semi-synchronous replica to create a Failover instance and an asynchronous replication to create multiple Read Only Instances. The scalability is actually focused on the read queries. If your use case requires multiple updates or inserts, it may be reasonable to change the data model.

 

Besides the Relational Database managed by the cloud providers have a limit of 64TB of storage (100 TB for Azure SQL Database), so if you need to manage more storage you should chose a Non Relational Database (Key-value for instance).

About the billing model you are charged for the following:

  • The number of nodes and/or instances type.
  • The amount of storage that your tables/indexes use and some vendor adds IOs (AWS).
  • The amount of specific network bandwidth used (mainly egress traffic).
  • Geo Replication.

Finally there are two more options to deploy a Relational Database in the cloud:

  • Unmanaged Market Relational Database like DB2, Oracle, and others with limited scalability and size. Where the Cloud Provider don’t have any responsibility of the management of the database
  • Third party Managed Market Relational Database where a third party takes the responsibility of manage the database on the Cloud Provider infrastructure and guaranty the portability of the database among other Cloud Providers (MongoDB Atlas is a clear example)

In the analysis, I will focus only on the Relational Databases managed by the Cloud Provider.

Non Relational Databases

Non Relational Databases can be categorized in four types:

  • Key-Value database or Hash table where the records are stored and retrieved using a key that uniquely identifies the record, and is used to quickly find the data within the database.
  • Document database is a subclass of key-Value that store all information for a given object (or document) in a single instance in the database, and every stored object can be different from every other (semi-structured data). In addition the Document database relies on internal structure in the document in order to extract metadata.
  • Columnar database that stores data tables by column rather than by row. Columnar database is optimized for fast retrieval of columns of data, typically in analytical applications which involve highly complex queries over all data.
  • Graph database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data.

In general, the Non Relational Databases offers better scalability, storage size and speed than Relational Databases at the cost of reducing the consistency (Remember the CAP theorem).

In addition, the Cloud Providers has decided to develop a custom/proprietary implementation of Non Relational Databases or in some cases implement a very standard Open Source solution. They don’t offer a managed solution for third party Non SQL Databases.

Finally, there are some common trends:

  • There is some consensus to offer a Document Database with MongoDB compatibility (with the exception of Google).
  • The Graph Database is emerging in 2019, and Google and Alibaba that don’t have a custom development are offering JanusGraph deployment waiting for a custom solution to be developed.
  • Amazon and Alibaba have decided to create a specialized Timestream database for IoT events and operational applications.

About the billing model you are charged for the following:

  • The CPU required based on one of the following options (depending of the cloud provider and Database)
    • The number of nodes or instances (AWS, Alibaba & Google).
    • Read/write requests (AWS, Alibaba & Google).
    • Request Units (combination of CPU, Memory and IOPs) (Azure).
    • Data Scanned (Warehouse option).
  • The amount of storage that your tables and some vendor add IOs (AWS).
  • The amount of specific network bandwidth used (mainly egress traffic).
  • Geo Replication and other additional functions like cache accelerators

As in the Relational Database, there are two additional options to deploy a Non Relational Database in the cloud:

  • Unmanaged Market Non Relational Database
  • Third party Managed Market Non Relational Database where a MongoDB Atlas approach is a clear reference

Object Storage

Object storage manages data as objects. Each object typically includes the data itself, a variable amount of metadata, version and a globally unique identifier.

In general a single object can be up to 5 TB in size.

Object-storage systems allow retention of massive amounts of unstructured data. Object storage is used for purposes such as storing media content, backups, archive and integrated repository for analytics and Machine Learning.

Objects can be organized in sublevels (Buckets or Containers).

The Cloud Providers offers different storage classes with specific SLA and cost like:

  • High Frequency Access.
    • Multi-Regional where the objects are replicated on multiple regions to improve latency and availability.
    • Regional.
  • Low Frequency Access where the cost of the storage is lower if you access the data infrequently (less than 1 time per month for example).
  • Lowest Frequency Access usually historical data storage for backups than don’t required to be access more than one time per year and has the cheaper storage cost.

And also offer a Versioning & Life Cycle Management.

About the billing model you are charged for the following:

  • Storage depending of the class.
  • Operation Usage (get, Put, create, Delete,…) also depending of the class.
  • The amount of specific network bandwidth used (mainly egress traffic).
  • Geo Replication and other additional functions.

Block Storage

Block Storage manages data as blocks within sectors and tracks. Block storage is data storage typically used in storage-area network (SAN) environments or attached to the VM where data is stored in volumes, also referred to as blocks. Each block is assigned an arbitrary identifier by which it can be stored and retrieved, but no metadata providing further context.

File systems and databases are common uses for block storage because they require consistently high performance.

About the billing model you are charged for the following:

  • Volume Type (SSD, HD, Ultra SSD…).
  • Storage.
  • Snapshots.

File Storage

File Storage manage data as a file hierarchy as a fully managed Network Attached Storage (NAS). File storage provides a centralized, highly accessible location for files, and generally comes at a lower cost than block storage. File storage uses metadata and directories to organize files, which makes it a convenient option for an organization looking to simply store large amounts of data.

The File Storage in general supports two protocols:

  • SMB version 3.0 protocol for windows.
  • NFS v3-4 for Linux and others.

However Google and Amazon only implement NFS protocol, so windows machine can’t use his native protocol.

About the billing model you are charged for the following:

  • Storage (per Class if any).
  • Data Transfer out.

In Memory Storage

Memory Storage is a Storage System that primarily relies on main memory for computer data storage. The main use of a Memory Storage is to have a very fast cache for read only data.

The Cloud Providers deploy Memory Storage based on Redis and Memcache under a fully managed in-memory data store service.

Redis is the main bet for all the Cloud Provider thanks to the advanced capabilities.

About the billing model you are charged for the following:

  • Service Tier (Cache node Type and nodes).
  • Storage.
  • Data Transfer out or inter AZ.
  • Region of the service.

Hybrid Cloud Storage

Additionally cloud providers begin to provide hybrid storage solutions (internal or third party) for use cases like;  moving tape backups to the cloud, reducing on-premises storage with cloud-backed file shares, providing low latency access to data for on-premises applications, as well as various migration, archiving, processing, and disaster recovery use cases.

Analytics

The Analytics platform ( Data Computing, Data visualization, Data Search and Analytics and Data development) could also be under the Storage and Database services, but because it has its own entity and there are more services than pure storage, it will be treated in a different chapter.

Database and Storage Services Use Cases & Recommendations

ProsConsUse Cases

In Memory

  • Fast Low Latency Access
  • Limit Size
  • Can be lost in some situations

  • Caching
  • Gaming
  • Stream Processing
  • Chat & messaging
  • Real Time analytics
  • Session store
  • Object Storage

  • Multiregional
  • Highly Scalable, Durable
  • Versioning & LCM
  • Only for not structured data
  • Image & Media Serving
  • Integrated repository for analytics and ML
  • Backup & DR
  • Archive
  • Hybrid cloud storage
  • Block storage

  • VM Storage
  • Sharing between VM
  • Fast Snapshots
    Resizing
  • Storage is tied (insert/update) to one server at a time
  • Block storage for compute or container engines
  • Legacy File/Database migration
  • Business continuity
  • File Storage

  • Can be accessed using regular file share methods, such as a mapped drive or file I/O APIs and comands
  • Cheaper that Block Storage (pay per use)
  • Limit storage
  • Low consistency
  • Legacy File migration
  • Shared storage
  • Web content
  • Log files
  • Application Configuration files
  • Relational Database

    (Market Managed Database)
  • ACID
  • Multiple Read Replica
  • Simple & Fully Managed
  • Limitations on Horizontal Scalability and storage
  • Strict Schema
  • Single region
  • Legacy (MySQL, PostgressSQL, ...)
  • Ecommerce or CMS
  • Finance
  • Relational Database

    (Cloud Provider Database)
  • ACID
  • Horizontal Scalable
  • Strong consistency
  • SQL support
  • Highly availability
  • Multiregional
  • Works with structured data
  • Limit Storage
  • RDBMS with horizontal scalability
  • High Transactionality apps
  • Key-Value Database

  • Low Latency
  • Massively scalable NoSQL (key/value)
  • Hundred Petabyte
  • Easy integration with open source big data tools
  • Works only with structured data
  • Single region
  • No SQL access
  • ACID support only at row level
  • Ad Tech, Fintech, and IoT
  • Microservices Data
  • Storage engine for machine learning application
  • Document Database



  • Horizontal Scalable
  • Eventually consistency
  • Low latency
  • Works with Semi Structure data
  • SQL access
  • Support of complex joins
  • Single Region
  • Up to nTerabytes
  • Document Oriented / Hierarchical with horizontal scalability
  • Product  Catalog
  • User Profile
  • Game
  • Content Management
  • Columnar Database

  • Massively scalable EDW
  • OLAP workload and up to petabyte scale
  • Real-time Analytics
  • Automatic High Availability
  • Standard SQL
  • No transactional: latency and no good for updates
  • Enterprise Data Warehouse/Big data/BI
  • Graph Database

  • Better performance for querying related data (relationships)
  • Not efficient at processing high volumes of transactions
  • Social networks
  • Fraud analysis
  • Physical networks
  • Recommendation engines
  • Life sciences