If there is one topic that has generated more debate than anything else, it is the scalability of lists in SharePoint. You can find as many opinions as there are people. Because I’m one of those people, allow me to express my opinions on this as well.
My opinion is that Microsoft designed the SharePoint content database for low management overhead, not for ultra-extreme scalability. But where you need that extreme scalability, the architecture provides enough hooks and management infrastructure to satisfy almost any conceivable need for a real-world project.
Now let’s argue my opinion. First of all, what is this concept of scalability? Scalability is vastly different from performance.
Scalability versus Performance
Let’s imagine a non-computer task. Let’s say your business is to carry goods from point A to point B on the back of a donkey. Business is good, so you seem to be getting more and more requests to carry such goods. Soon enough, you realize that you need a donkey that performs better. So you replace your donkey and get a body builder donkey. As business grows, this body builder donkey is not enough, either. So you replace him too, and replace him with a really expensive Sylvester Stallonedonkey. Business is getting even crazier, so you replace the Sylvester Stallone donkey. You get an Arnold Schwarzenegger donkey and you pump him with steroids. By now, your donkey is really performing well, but he’s also really expensive. And even this crazy donkey has limits. He is expensive, high maintenance, doesn’t show up at work on time, and some people say he has a funny accent.
You can achieve the same business goal by using numerous cheaper donkeys and scaling your operation among the numerous cheaper donkeys. This means that a scalable architecture breaks down its tasks into equivalent activities that any cheaper donkey can pick.
The process of finding a stronger and stronger donkey is aiming for better performance. The equivalent in the computer world is replacing weaker servers with stronger servers. The process of distributing distributable loads across multiple servers is referred to as scaling out your operation.
As you can see, scalability is a very different animal (no pun intended) from performance. The end result is perhaps the same; you can support more business and more users hitting your system. It is how you solve the problem - by providing better performance or better scalability - that’s what differentiates the cost. Numerous less-powerful servers are almost always cheaper than a single very high performance supercomputer/superdonkey.
Now let’s leave the donkeys behind and come back to SharePoint. In SharePoint, you can distribute your application across different site collections and different sites. Within all these sites and site collections, your data will eventually live inside of lists or document libraries. As an architect of the SharePoint project, the first thing you should do is to attempt to distribute the logical layout of your applications among sites and site collections - just as you would solve the scalability problem in any other project.
However, in every project the system can be scaled out only to a point where the distributed load is identical. In other words, the overall throughput of a system depends both on scalability and performance. So performance is important, too! And you cannot talk about lists scalability in SharePoint while turning a blind eye to performance. So let me talk a little bit about performance, too.
Because you’re a developer, I’d like you to sit back and think of the most complicated project you’ve ever worked on. Chances are that project involved some sort of database. In that complicated database, how many tables had more than 10,000 or 20,000 rows? I would say that ignoring reporting databases, in most scenarios 80% of your tables don’t have that many rows. This number of 10,000 or 20,000 rows is something that SharePoint lists can very comfortably handle. But you also have to consider that when developing for a system and making an architecture scalable, you must also consider all sorts of operations that may perform on that data. These may be inserts, updates, deletes, or queries. Also, SharePoint lists give you a ton of infrastructure on top that lets you create sophisticated user interfaces and permissioning models on top of your data. All that doesn’t come for free!
But despite all the infrastructure and additional features you get on top of your data with SharePoint, I can very confidently say that when targeting that 80% scenario, SharePoint 2010 lists are more than capable of handling all sorts of operations at that number. In fact, there are built-in mechanisms such as indexing that in reality let SharePoint go much beyond this number for query operations quite comfortably.
The 20% Scenario
Now let’s consider the 20% scenario. It consists of cases in which you have a very large number of items in a list (let’s say millions), extremely large documents inside a document library, or a large number of smaller documents in a document library.
Let’s discuss the 20% scenario of document libraries first. Document libraries accentuate scalability and performance concerns because SharePoint stores its documents inside the content database by default. Is storing documents as blobs inside a database a good idea?
Perhaps the best and the worst thing about databases is that there is no black or white answer. A lot of people argue that storing blobs in a database is not recommended. Nothing could be further from the truth. There are very significant advantages to putting blobs in a database; for instance, you don’t have any file name overwrite issues, you don’t have virus issues, backups are easy, transactions are possible, and so on. One argument I have heard against using blobs in a database is that performance may suffer. Again, such a black-or-white argument can be very misleading, especially when talking about databases. In fact, performance depends on how often the blobs get updated and the size of those blobs. Databases tend to get fragmented more easily than file systems. And as they get more and more fragmented, their performance gets worse.
File systems, on the other hand, tend to defragment themselves in downtimes. Also, new advances in solid state drive technologies make defragging a problem of the past. While file systems have the advantage of being less susceptible to fragmentation, they suffer from some significant disadvantages as well. Given the restricted cluster size of the file and the effort required to read the position off the file before the file itself can be read, the performance and overhead for extremely small files may suffer. With smaller file sizes, you can actually get better performance by putting the smaller files in a database rather than on a file system.
As a rough rule of thumb, files between zero KB and 256 KB will almost always perform better when placed in the database. Files between 256 KB and 10MB are a gray area and depend upon how often those files get updated. Files larger than 10 MB are better stored on the file system.
As a result, when storing blobs inside the SharePoint content database, you as the architect have to make a conscious decision based upon the file size and the update frequency of those blobs. By default, the blobs will go in the content database. However, with SharePoint 2010 you have the ability to leverage Remote Blob Storage (RBS). Using an RBS provider, you can choose to store the actual file in an alternate store without affecting the SharePoint object model at all. In other words, the storage location of the document?be it the content database or an RBS store?is invisible/transparent to the SharePoint object model and thus to your applications.
If you want to read more about the inner details of this topic, read the "To Blob or Not to Blob" white paper from Microsoft research at http://research.microsoft.com/apps/pubs/default.aspx?id=64525.
Next, let me talk about the 20% scenario with lists that contain millions of items. If you reverse-engineer the SharePoint content database structure (which you should do only for learning purposes) you see some significant changes between SharePoint 2007 and SharePoint 2010. Queries and tables now use indexes that are integers in addition to GUIDs, queries now provide NOLOCK hints and ask the execution plans to use those integer-based indexes. Connection strings use ENLIST=FALSE, thus preventing distributed transactions and isolation levels escalating to serializable. There are many such under-the-cover improvements that allow a SharePoint content database to be much more scalable than before. However, the SharePoint content database has been designed so it can be managed with less overhead. Your custom applications usually will find a database administrator updating statistics at 2:00 AM, which is something that a SharePoint content database does not have the luxury for. So with auto updating statistics and clustered indexes, you still have the ability to easily query millions of items in a single list and return the results in a very short time. Such really fast queries can be done using the object model. However, the user interface has been designed more for the 80% scenario. In other words, if you generate HTML that is going to show you millions and millions of rows as one query result, obviously that page will load more slowly. And to be honest, I’d rather have a system that works well for 80% and not have the 20% tail wag the 80% dog.
Still, the 20% scenario is important. And to prevent even this problem, SharePoint comes with numerous management features that prevent average users from querying more than a certain number of items or creating projections of lists with too many columns in them. These are commonly referred to as list throttling scenarios.
Thus, with extremely large lists, querying is never a problem. What can become a problem, however, is inserting or updating items.
So suppose you have a borderline scenario of a table with millions and millions of items, and these millions and millions of items are changing very frequently. First, I would argue that this is a very borderline scenario. Second, this being a borderline scenario, it is reasonable to expect organizations to hire a database administrator to manage such a crazy table. And finally, even such a table can be used inside of SharePoint using business connectivity services. So it is really not such a problem after all.
Thus with all this background behind us, let me restate my opinion.
In my opinion, the SharePoint content database has been designed for low management overhead, not for ultra-extreme scalability. But where you need that extreme scalability, the architecture provides enough hooks and management infrastructure to satisfy almost any conceivable need for a real-world project.
I made some interesting allusions to various facilities inside SharePoint 2010 that let you satisfy the 20% extreme scenarios:
- List column indexing
- List throttling and management scenarios
- SharePoint 2010 RBS storage
You can read more about each one of these in Chapter 7 of my book, and about BCS in Chapter 9 of my book.
Summary
The word “scalability” is vastly misunderstood and greatly misused. In this article, I expressed my opinions about the practical considerations of using SharePoint for a real-world application. As I argued, with the right architecture, considering both performance and scalability, SharePoint is a decent platform for most needs. Not only are there real improvements in the inside framework of SharePoint - stuff that Microsoft wrote, there are also enough extensibility points in the product to truly allow you to build enterprise class solutions using SharePoint 2010.
Until next time, Happy SharePointing.