Dan Croak: What is NoSQL?
Some of the most misused and misunderstood terms in technology today are Web 2.0, the cloud, Agile, NoSQL, and HTML5. Today, I'd like to describe where NoSQL came from and how it applies to internet startups in Boston.
NoSQL Summer
The first thing to know is that there is a bi-weekly NoSQL reading group every other Thursday at the Microsoft NERD Center. The next event is July 29th. We read the academic research papers on different techniques, eat dinner together, and discuss the papers, whiteboarding with a gorgeous view of Boston.## SQL
Developed in the 70s and relying heavily on mathematical theory to work efficiently, SQL is a language for talking to relational databases. SQL databases have worked so well that they've become the de facto standard for storing data, particularly on the web. You've no doubt interacted with thousands of web applications that used a SQL database.
SQL
Developed in the 70s and relying heavily on mathematical theory to work efficiently, SQL is a language for talking to relational databases. SQL databases have worked so well that they've become the de facto standard for storing data, particularly on the web. You've no doubt interacted with thousands of web applications that used a SQL database.
When you imagine a SQL database, think of tables of rows and columns like an Excel spreadsheet.
In the startup world, most web developers use MySQL or Postgres open source relational databases.
Compared to NoSQL databases, SQL databases are like Fot Knox. They have 30 years of features built into them and once data is there, it's almost certainly staying there. All the little edge cases of two people trying to update the same record are handled with data integrity as the foremost concern.
In the startup world, most web developers use MySQL or Postgres open source relational databases.
NoSQL
NoSQL refers to a class of databases 1) are intended to perform at internet (Facebook, Twitter, LinkedIn) scale and 2) reject the relational model in favor of other (key-value, document, graph) models.
They often achieve performance by having far fewer features than SQL databases. Rather than being a "jack of all trades, master of none", NoSQL databases tend to focus on a subset of use cases.
Performance at Internet Scale
Let's say you create a web startup in Boston with your buddy, who's a hacker. He's got a $999 Mac mini in his closet and you launch the site on it. The machine is cheap, it's connected to the internet, it gets the job done. It's a Ruby on Rails application with a Postgres database and it all fits fine on the Mac mini at first.

Beta testers love it and soon the Mac mini is reaching its limits. What next?
The first rule of scaling is: scale vertically. Meaning, "buy a bigger machine." More memory, more disk space.
Then, scale horizontally. This results in an architecture that looks something like this:

Many web apps end up looking like this, with more than one server running application logic as requests are distributed across them, and the relational database either on one of the application servers or on its own database server.
Looking at that diagram, where do you think the bottleneck might be? Yeah, the database.
So the next thing people often do is set up multiple databases where one is the master that all writes go to and the others are read-only and load is distributed across them.
Unfortunately, this is the end of the road in terms of "easy" horizontal scaling. Scaling reads in this way isn't too painful but scaling writes turns out to be very painful. In modern web apps, writes can come fast and furiously:
* 2,000 TPS (Tweets Per Second)
* 25 billion pieces of content shared each month
* new member signup every second
Unsurprisingly, these companies have created NoSQL tools like Cassandra (created by Facebook, used by Facebook, Twitter, and Digg) and generously open sourced them to the software community.
Most of us on most apps will not have Big Data problems like these folks but it becomes less and less far-fetched every year. Even just building a "platform" app on top of Facebook or Twitter may mean that a NoSQL database may be a better tool for the job, if not for Big Data reasons than for data structure reasons.
Why would structure matter?
Side note: other strategies for handling scaling problems include caching, using popular tools like memcached, which was created by LiveJournal. Some people consider memcached to be the father of NoSQL databases given its public arrival and its extreme focus on horizontal scalability. Other use a content distribution network like Boston's own Akamai. It's been speculated that the top 100 websites in the world all use Akamai's services.
Structuring data without relations
Let's consider the Twitter example again. A tweet is a very small piece of text created by one user. Twitter should be able to save it quickly and then distribute it widely to that user's followers so they can all read it.
Now, if I go to that user's profile and don't immediately see their latest tweet, it's not the end of the world. If it shows up seconds (or even minutes?) later, no big deal. It's a tweet.
Twitter used to use a relational database (MySQL) and caching (memcached) to handle this but they're switching to Cassandra because it's designed for this kind of data. A "key-value" store like Cassandra is designed for this kind of data, and can be massively scaled horizontally. Other key-value databases include Redis) and Riak, created by Boston's own Basho.
Consider other "real-world" examples that you might need to model in your web application:
- resume
- business card
- receipt
All of these are self-contained data structures. Everything you need in real life is in one place. Yet, if you were to model this in a relational database, you'd probably have a "persons" table and an "orders" table and "line_items" table, splitting the data into it's atomic pieces.
But if you're the user, all you want is the receipt for your purchase. A NoSQL class of databases called document databases such as MongoDB and CouchDB are a good fit here. You might have a Receipt document that stores your customers' receipts, then just send that data cleanly back to the user.
So, is your data relational? Do parts of your application better fit a different data structure that NoSQL tools are really good at?
If you're not sure of the answers to those questions, please join us at the bi-weekly NoSQL reading group!
Dan Croak is a web developer at thoughtbot. He makes apps for Boston web startups. Email him atdcroak[at]thoughtbot[dot]com.
Photo Credit: tlossen on Flickr











