Developers & Practitioners
The definitive guide to databases on Google Cloud: Part 1 - Data modeling basics
The next best thing since sliced bread!
As I wake up to the smell of tea and freshly ground cardamom every morning, all I could think of is how nice it would be with some soft buttery sliced bread!
Hmm... if only it wasn't for the gluten allergy.
Anyway, the one thing that is half as old as sliced bread but twice as good as that is Database! The set of comprehensive application programs that can be leveraged to store, access, manage and update data whilst assuring structure, recovery, security, concurrency and more.
Yup exactly 52 years ago, E.F. Codd, the father of Database Management Systems (DBMS) propounded and formalized these as 12 commandments that are in fact 13 in number (starting from 0. I know! right?) that make up a DBMS. We have evolved since the 1960s when we used one database to store and secure information to these modern times where we use one-database-per-stage in the data lifecycle, in fact one database per data stage, type and structure in most cases.
In this blog we are going to discuss the business attributes, technical aspects, design questions, considerations to keep in mind while “Designing the Database Model” (and if you hang on till the end, a simple contest to ensure you have been entertained thoroughly!). And if you ask…
Is there a quick and dirty choice?
The short answer is NO. However your selection of database can be derived based on the answers to a multitude of database business requirement questions including the top 3 mandatory ones below:
1. What is the stage in the data lifecycle?
Starting from Ingestion (landing data from different sources in one place), Storage (staging in a persistent location for use later), Processing (transformational stages for analysis) to Visualization (derive and present insights from the analysis), we need to be aware of the stage in the lifecycle of data for which we are designing / discussing storage requirements.
2. What is the type of data you are bringing in?
There are broadly 3 types of data we would be dealing with that would highly influence the choice of database and storage.
a. Application: that covers transactional and event based data
b. Live Streaming (Real-time): that covers data from real time sources
c. Batch: that covers bulk, scheduled interval and event triggered data
Real-time data is immediate and constantly up to date, the integration of this type of data needs to be carried out at the time of the event. Whereas, Batch data process is scheduled at specific times and amounts.
3. Is your data Structured, Unstructured or Hybrid?
a. Structured data is modeled with rows, columns and are mostly transactional and analytical in nature
b. Unstructured could be anything like images, audio files etc. The amount of unstructured data is much larger than that of structured data, so the methods by which we store such data is more important than ever. This means that companies not taking unstructured data into account could be missing out on a lot of crucial business intelligence.
c. The semi-structured (hybrid) data are the ones with attributes defined but could vary for each record. The major differentiating factor for each kind of semi-structured data is in the way they are retrieved / accessed
Engineering or architecting, ask the right data questions!
There is always this question around the responsibilities concerning data.
If you design data architecture, manage business and technology requirements around the architecture, involve in design of data extraction, transformation, loading and provide direction to the team for methods of organizing, formats and presentation of data, then you are an Architect.
As an Engineer you create applications and develop solutions to enable data for distribution, processing, analysis and participate in one or more of those activities directly.
But in either case, you are an expert, you need to ask the right questions and need to set the right expectation as you approach the technical aspects of data. It is not always possible to get "the one" solution with these questions below but will definitely help get started and eliminate the mismatches easily right off the start.
Volume and Scalability:
- What is the size of data you are going to be dealing with at the time of design and at each stage in the lifecycle of the data?
- How much do you expect it to scale with time?
- What is the rate / schedule at which the data needs to be sent and processed?
- What is the variation expected to be seen in the data incoming?
- How much access restriction does your data need (Row level, object level, fine grained levels of access control), encryption, privacy and compliance?
And other most common areas of design consideration are Availability, Resilience, Reliability and Portability.
Choosing the right database
Having assessed all these questions and considerations, the logical next step is to choose from / eliminate from the database types out there.
- We have the good old Relational Database for the Online Transaction Processing (OLTP) that typically follow normalization rules and Online Analytical Processing (OLAP) that are typically used for Data Mart and Data Warehouse applications. This type requires Structured Query Language to define, manipulate, query and manage.
- And then we have the NOSQL database types for the semi-structured i.e. less structured than Relational database.
There is no formal model or normalization requirement for this type. Key-Value pair DB, Document DB, Wide Column DB, Graph DB are some types of NOSQL databases. More on each of these technologies to be covered in upcoming episodes, code labs and PoCs of the blog series.
Before I go
Phew, you would think that's it for now. Not quite. Let me leave you on a fun note. All this while we have been talking about the types, requirements, design aspects, database choices, and what not.
Here is a simple exercise to flex your understanding so far:
How would you model a NoSQL solution for an application that needs to query the lineage between individual entities that are represented in pairs? (E.g. If A-B, B-C, C-E, A-E, D-F are the row of records that are in pairs, your application should represent A,B,C,E belonging to one lineage and D,F belonging to another).
Tips for Modeling a NoSQL database:
What are the design questions that come to your mind?
Does NOSQL have a schema?
Sometimes it's misleading when we hear that NoSQL options are schema-less. They do not have a schema in the same strict way as the relational databases. However they have an underlying structure that is used to store the data. Each of the four main types of NoSQL databases is based on a specific way of storing data. Here is the logic for a data model in each case:
- Document databases store data in document data type, which is similar to a JSON document. Each document stores pairs of fields and values, with a wide variety of data types and data structures being used as values
- Key-Value database items consist of a key and a value, making this the simplest type of database. The data model consists of two parts: a string with some relationship to the data and the data. Data is retrieved using the direct request method (provide the key and get the data) rather than through the use of a query language
- Wide-column databases use a table form but in a flexible and scalable way. Each row consists of a key and one or more related columns, which are called column families. Each row’s key-column family can have different numbers of columns and the columns can have different kinds of data. Data is retrieved using a query language. This column structure enables fast aggregation queries
- Graph Databases consist of nodes connected by edges. Data items are stored in the nodes and the edges store information about how the nodes are related. Node and relationship information is typically retrieved using specialized query languages, sometimes SQL as well
Take a few minutes to think through the exercise mentioned in the above section
Also you can go through more details of this reference architecture and database options here: https://cloud.google.com/architecture/data-lifecycle-cloud-platform