Google Cloud storage options

Tuesday 9 July 2024

Google Cloud storage options

Most applications need to store data e.g. media to be streamed, sensor data from devices.
Different applications and workloads require different storage database solutions.

Google Cloud has storage options for different data types:

structured
unstructured
transactional
relational

Google Cloud has five core storage products:

Cloud Storage (like AWS S3)
Cloud SQL
Spanner
Firestore
Bigtable

(1) Cloud Storage

Object Storage

Let's first define Object Storage.

Object storage is a computer data storage architecture that manages data as “objects” and not as:

a file and folder hierarchy (file storage) or
as chunks of a disk (block storage)

These objects are stored in a packaged format which contains:

binary form of the actual data itself
relevant associated meta-data (such as date created, author, resource type, and permissions)
globally unique identifier. These unique keys are in the form of URLs, which means object storage interacts well with web technologies.

Data commonly stored as objects include:

video
pictures
audio recordings

Cloud Storage:

Service that offers developers and IT organizations durable and highly available object storage
Google’s object storage product
Allows customers to store any amount of data, and to retrieve it as often as needed
Fully managed scalable service

Cloud Storage Uses

Cloud Storage has a wide variety of uses. A few examples include:

serving website content
storing data for archival and disaster recovery
distributing large data objects to end users via Direct Download

Its primary use is whenever binary large-object storage (also known as a “BLOB”) is needed for:

online content such as videos and photos
backup and archived data
storage of intermediate results in processing workflows

Buckets

Cloud Storage files are organized into buckets.

A bucket needs:

globally unique name
specific geographic location for where it should be stored

An ideal location for a bucket is where latency is minimized. For example, if most of our users are in Europe, we probably want to pick a European location, so either a specific Google Cloud region in Europe, or else the EU multi-region

The storage objects offered by Cloud Storage are immutable, which means that we do not edit them, but instead a new version is created with every change made. Administrators have the option to either allow each new version to completely overwrite the older one, or to keep track of each change made to a particular object by enabling “versioning” within a bucket.

With object versioning:

Cloud Storage will keep a detailed history of modifications (overwrites or deletes) of all objects contained in that bucket
We can list the archived versions of an object, restore an object to an older state, or permanently delete a version of an object, as needed

Without object versioning:

by default new versions will always overwrite older versions

Access Control

In many cases, personally identifiable information may be contained in data objects, so controlling access to stored data is essential to ensuring security and privacy are maintained. Using IAM roles and, where needed, access control lists (ACLs), organizations can conform to security best practices, which require each user to have access and permissions to only the resources they need to do their jobs, and no more than that.

There are a couple of options to control user access to objects and buckets:

For most purposes, IAM is sufficient. Roles are inherited from project to bucket to object.
If we need finer control, we can create access control lists. Each access control list consists of two pieces of information:

scope, which defines who can access and perform an action. This can be a specific user or group of users
permission, which defines what actions can be performed, like read or write

Because storing and retrieving large amounts of object data can quickly become expensive, Cloud Storage also offers lifecycle management policies

For example, we could tell Cloud Storage to delete objects older than 365 days; or to delete objects created before January 1, 2013; or to keep only the 3 most recent versions of each object in a bucket that has versioning enabled
Having this control ensures that we’re not paying for more than we actually need

Storage classes and data transfer

There are four primary storage classes in Cloud storage:

Standard storage

considered best for frequently accessed or hot data
great for data that's stored for only brief periods of time

Nearline storage

Best for storing infrequently accessed data, like reading or modifying data on average once a month or less
Examples may include data backups, long term multimedia content, or data archiving.

Coldline storage

A low cost option for storing infrequently accessed data.
However, as compared to near line storage, coldline storage is meant for reading or modifying data at most, once every 90 days.

Archive storage

The lowest cost option used ideally for data archiving, online backup and disaster recovery
It's the best choice for data that we plan to access less than once a year because it has higher costs for data access and operations in a 365 day minimum storage duration

Characteristics that apply across all of these storage classes:

unlimited storage
no minimum object size requirement
worldwide accessibility and locations
low latency and high durability
a uniform experience which extends to security tools and API's
geo-redundancy if data is stored in a multi-region or dual region. This means placing physical servers in geographically diverse data centers to protect against catastrophic events and natural disasters, and low balancing traffic for optimal performance.

Auto-class

Cloud storage also provides a feature called auto-class, which automatically transitions objects to appropriate storage classes based on each object's access pattern. The feature:

moves data that is not accessed to colder storage classes to reduce storage costs
moves data that is accessed to standard storage to optimize future accesses

Auto-class simplifies and automates cost saving for our cloud storage data.

Cloud storage has no minimum fee because we pay only for what we use. Prior provisioning of capacity isn't necessary.

Data Encryption

Cloud storage always encrypts data on the server side before it's written to disc at no additional charge. Data traveling between a customer's device and Google is encrypted by default using HTTPS/TLS, which is transport layer security.

Data Transfer into Google Cloud Storage

Regardless of which storage class we choose, there are several ways to bring data into Cloud storage:

Online Transfer

by using Cloud storage, which is the Cloud storage command from the Cloud SDK
by using a Dragon Drop option in the Cloud console if accessed through the Google Chrome web browser

Storage transfer service

enables us to import large amounts of online data into Cloud storage quickly and cost effectively
if we have to upload terabytes or even petabytes of data
Lets us schedule and manage batch transfers to cloud storage from:

another Cloud provider
a different cloud storage region
an HTTPS endpoint

Transfer Appliance

A rackable, high capacity storage server that we lease from Google Cloud
We connect it to our network, load it with data, and then ship it to an upload facility where the data is uploaded to cloud storage
We can transfer up to a petabyte of data on a single appliance

Moving data in internally, from Google Cloud services as Cloud storage is tightly integrated with other Google Cloud products and services. For example, we can:

import and export tables to and from both BigQuery and Cloud SQL
store app engine logs, files for backups, and objects used by app engine applications like images
store instance start up scripts, compute engine images, and objects used by compute engine applications

We should consider using Cloud Storage if we need to store immutable blobs larger than 10 megabytes, such as large images or movies. This storage service provides petabytes of capacity with a maximum unit size of 5 terabytes per object.

Provisioning Cloud Storage Bucket

We can use e.g. Google Cloud console >> Activate Cloud Shell:

Then execute the following commands in it.

Create an env variables containing the location and bucket name:

$ export LOCATION=EU

$ export BUCKET_NAME=my-unique-bucket-name

or we can use the project ID as it is globally unique:

$ export BUCKET_NAME=$DEVSHELL_PROJECT_ID

To create a bucket with CLI:

$ gcloud storage buckets create -l $LOCATION gs://$BUCKET_NAME

We might be prompted to authorize execution of this command:

To download an item from a bucket to the local host:

$ gcloud storage cp gs://cloud-training/gcpfci/my-excellent-blog.png my-excellent-blog.png

To upload a file from a local host to the bucket:

$ gcloud storage cp my-excellent-blog.png gs://$BUCKET_NAME/my-excellent-blog.png

To modify the Access Control List of the object we just created so that it's readable by everyone:

$ gsutil acl ch -u allUsers:R gs://$BUCKET_NAME/my-excellent-blog.png

We can check in Google Console that bucket and the image in it:

(2) Cloud SQL

It offers fully managed relational databases as a service, including:

MySQL
PostgreSQL
SQL Server

It’s designed to hand off mundane, but necessary and often time-consuming, tasks to Google, like

applying patches and updates
managing backups
configuring replications

Cloud SQL:

Doesn't require any software installation or maintenance
Can scale up to 128 processor cores, 864 GB of RAM, and 64 TB of storage.
Supports automatic replication scenarios, such as from:

Cloud SQL primary instance
External primary instance
External MySQL instances

Supports managed backups, so backed-up data is securely stored and accessible if a restore is required. The cost of an instance covers seven backups
Encrypts customer data when on Google’s internal networks and when stored in database tables, temporary files, and backups
Includes a network firewall, which controls network access to each database instance

Cloud SQL instances are accessible by other Google Cloud services, and even external services.

Cloud SQL can be used with App Engine using standard drivers like Connector/J for Java or MySQLdb for Python.
Compute Engine instances can be authorized to access Cloud SQL instances and configure the Cloud SQL instance to be in the same zone as our virtual machine
Cloud SQL also supports other applications and tools, like:

SQL Workbench
Toad
other external applications using standard MySQL drivers

Provisioning Cloud SQL Instance

SQL >> Create Instance:

...and then choose values for following properties:

Database engine:

MySQL
PostgreSQL
SQL Server

Instance ID - arbitrary string e.g. blog-db
Root user password: arbitrary string (There's no need to obscure the password because we use mechanisms to connect that aren't open access to everyone)
Choose a Cloud SQL edition:

Edition type:

Enterprise
Enterprise Plus

Choose edition preset:

Sandbox
Development
Production

Choose region - This should be the same region and zone into which we launched the Cloud Compute VM instance. The best performance is achieved by placing the client and the database close to each other.
Choose zonal availability

Single zone - In case of outage, no failover. Not recommended for production.
Multiple zones (Highly available) - Automatic failover to another zone within your selected region. Recommended for production instances. Increases cost.

Select Primary zone

click on image to zoom

During DB creation:

click on image to zoom

Once DB instance is created:

DB has root user created:

Default networking:

Now we can:

see its Public IP address (e.g. 35.204.71.237)
Add User Account

username
password

set Connections

Networking >> Add a Network

Choose between Private IP connection and a Public IP connection
set Name
Network: <external_IP_of_VM_Instance>/32 (If chosen Public IP connection then use instance's external IP address)

Adding a user:

After user is added:

Adding a new network:

After new network is added:

(3) Spanner

Spanner:

Fully managed relational database service that scales horizontally, is strongly consistent, and speaks SQL
Service that powers Google’s $80 billion business (Google’s own mission-critical applications and services)
Especially suited for applications that require:

SQL relational database management system with joins and secondary indexes
built-in high availability
strong global consistency
high numbers of input and output operations per second (tens of thousands of reads and writes per second or more)

The horizontal scaling approach, sometimes referred to as "scaling out," entails adding more machines to further distribute the load of the database and increase overall storage and/or processing power. [A Guide To Horizontal Vs Vertical Scaling | MongoDB]

We should consider using Cloud SQL or Spanner if we need full SQL support for an online transaction processing system.

Cloud SQL provides up to 64 terabytes, depending on machine type, and Spanner provides petabytes.

Cloud SQL is best for web frameworks and existing applications, like storing user credentials and customer orders. If Cloud SQL doesn’t fit our requirements because we need horizontal scalability, not just through read replicas, we should consider using Spanner.

(4) Firestore

Firestore is a flexible, horizontally scalable, NoSQL cloud database for mobile, web, and server development.

With Firestore, data is stored in documents and then organized into collections. Documents can contain complex nested objects in addition to subcollections. Each document contains a set of key-value pairs. For example, a document to represent a user has the keys for the firstname and lastname with the associated values.

Firestore’s NoSQL queries can then be used to retrieve:

individual, specific documents or
all the documents in a collection that match our query parameters

Queries can include multiple, chained filters and combine filtering and sorting options. They're also indexed by default, so query performance is proportional to the size of the result set, not the dataset.

Firestore uses data synchronization to update data on any connected device. However, it's also designed to make simple, one-time fetch queries efficiently. It caches data that an app is actively using, so the app can write, read, listen to, and query data even if the device is offline. When the device comes back online, Firestore synchronizes any local changes back to Firestore.

Firestore leverages Google Cloud’s powerful infrastructure:

automatic multi-region data replication
strong consistency guarantees
atomic batch operations
real transaction support

We should consider Firestore if we need massive scaling and predictability together with real time query results and offline query support. This storage service provides terabytes of capacity with a maximum unit size of 1 megabyte per entity. Firestore is best for storing, syncing, and querying data for mobile and web apps.

(5) Bigtable

Bigtable:

Google's NoSQL big data database service
The same database that powers many core Google services, including Search, Analytics, Maps, and Gmail
Designed to handle massive workloads at consistent low latency and high throughput, so it's a great choice for both operational and analytical applications, including Internet of Things, user analytics, and financial data analysis.

When deciding which storage option is best, we should choose Bigtable if:

We work with more than 1TB of semi-structured or structured data
Data is fast with high throughput, or it’s rapidly changing
We work with NoSQL data. This usually means transactions where strong relational semantics are not required
Data is a time-series or has natural semantic ordering
We work with big data, running asynchronous batch or synchronous real-time processing on the data
We are running machine learning algorithms on the data

Bigtable can interact with other Google Cloud services and third-party clients.

Using APIs, data can be read from and written to Bigtable through a data service layer like:

Managed VMs
HBase REST Server
Java Server using the HBase client

Typically this is used to serve data to applications, dashboards, and data services.

Data can also be streamed in through a variety of popular stream processing frameworks like:

Dataflow Streaming
Spark Streaming
Storm

And if streaming is not an option, data can also be read from and written to Bigtable through batch processes like:

Hadoop MapReduce
Dataflow
Spark

Often, summarized or newly calculated data is written back to Bigtable or to a downstream database.

We should consider using Bigtable if we need to store a large number of structured objects. Bigtable doesn’t support SQL queries, nor does it support multi-row transactions. This storage service provides petabytes of capacity with a maximum unit size of 10 megabytes per cell and 100 megabytes per row. Bigtable is best for analytical data with heavy read and write events, like AdTech, financial, or IoT data.

---

BigQuery hasn’t been mentioned in this section because it sits on the edge between data storage and data processing. The usual reason to store data in BigQuery is so we can use its big data analysis and interactive querying capabilities, but it’s not purely a data storage product.

My Public Notepad

Pages

Tuesday 9 July 2024