I've been speaking with guys from Adoreboard a start up based in Belfast, on how Big Data works - from a practical point of view. The differing data structures often mean that a typical monolithic approach to data storage doesn't always make sense.
Bob Marks, from Adoreboard wrote a guest post on all things Big Data and Data Storage related and describes how using multiple data storage techniques not only helps the Adoreboard team achieve their technical goals, but also doesn't necessarily have the engineering impact you would imagine.
Delivery and Databases
By Bob Marks Chief Architect, Adoreboard.com
Gone are the days when items bought from a mail order catalogue that resulted in a wait of up to thirty days for delivery. Nowadays with the massive prevalence of online e-commerce sites consumers can expect their purchases within two or three days.
Most stores provide next day delivery from around the world, either at additional cost or as part of a subscription model (e.g. Amazon Prime). The logistics of delivering goods globally has resulted in innovation such as the creation of ‘The Maersk (Triple E class)’, recently constructed as the biggest cargo ship in the world. The scale of the ship is staggering at 1300 feet long (quarter of a mile) and 200 feet wide, and is the height of a 20-storey building. The ship is so big that few ports can accommodate it.
However, the advantages of such a large ship are in its economies of scale and results in the cheapest method of transporting goods around the world. On the other end of the scale bespoke delivery services exist such as cycling couriers popular in large cities such as London or New York that can deliver important parcels within an hour.
Parallels to the delivery world can be seen in software companies with regards databases – both large and small. I saw this first hand when I worked for IBM. Our database choice was simple - we used IBM-DB2, a robust relational database. In other jobs I have used Oracle, another massive player in the database world. These databases were super powerful but the main disadvantage was cost incurred through large licence fees. Also, nowadays one of the favourite buzzwords being currently tossed about is "Big Data" with various NoSQL databases fighting for the top position.
As any software architect knows, one of big decisions facing them is choosing a database at the start of the project. In other words, which ship are you going to use to transport your most prized assets?
So let’s take the new NoSQL style databases. The advantages of these data stores are around scale and the fact that they do not enforce the creation of a schema. However, software architects and engineers like the comfort of using relational databases. Their proven advantages include data normalisation and the creation of powerful queries using a straightforward and well defined query language (SQL). Also, relational databases have been around for years so the perceived risk can seem lower than the new "Big Data" databases. Let’s look at some example databases in turn.
The first of these databases is MySQL, an extremely popular relational database which is used for storing smaller sized tables such as user account information, configuration data to name but a couple. MySQL is analogous to a delivery van - popular, easy to use and works well for most situations and handles most loads encountered in everyday usage.
The next of these databases is MongoDB, a new but very popular NoSQL document based database that stores its documents in a JSON style format and can handle terabytes of data across multiple servers (or shards). This is a good fit for storing raw content, time based event and log information. This database compares to a large delivery lorry, great for transporting large amounts of parcels (but overkill for one birthday card!)
The last of the databases is ElasticSearch which we can be used for storing textual data and provides us with extremely powerful search capabilities such as fuzzy matching which most databases don't provide. In the delivery world this type of database could be compared to the cycling courier as it does one job extremely well – fast and nimble.
A question that I get asked is: “Does supporting 3 databases make the creation of code more difficult to write and maintain?” The short answer is no – the reason being is that as developers we need to employ generalisation beyond our own specialisms. This is based on the reality, that we as engineers must learn the idiosyncrasies of three different database engines.
Although there are a few caveats, life can be made much easier: firstly, code can be written in a certain style, for example implementing data access through DAO (data access objects) which acts as an intermediary between the application code and the database. Secondly, using dependency injection framework such as Spring Framework makes the code much easier to write, test and maintain and we can mix and match our DAO implementations.
At adoreboard.com, we use multiple databases such as MySQL, ElasticSearch and MongoDB. All of these databases are open source, free to use, each excels in various ways and enable ‘the right tool for each job’ across our entire product stack.
So returning to ‘The Maersk’ analogy – there is a temptation to build the world’s biggest ship and that’s the only mode of transport for your precious cargo. My argument is that architects must think anew and consider what I am calling the ensemble approach to databases. That is blending a range of databases to match the particular business object at hand. That is right, build mini Maersks using the tools that most make sense to you. It will make you think smarter about your business, and you never know you might be just fine with the cycling courier!