For years developers have been asking for query over data contained in a DataSet in a way that supports the expressiveness needed by today’s data-centric .NET applications.
As part of the .NET framework 3.5, Microsoft® will introduce support for a technology called Language Integrated Query (LINQ), and with this introduction, an implementation of LINQ to DataSet.
The ADO.NET programming model gave us the ability to explicitly cache data in an in-memory data structure called the DataSet. The DataSet uses the relational paradigm of tables and columns to store and represent the cached data. In addition, the DataSet also exposes a number of services (i.e., metadata, constraints, change tracking, etc.) that makes working with data from a relational database an easy and straightforward task.
The one glaring weakness of the DataSet is not exposing rich query capabilities, the kind of support developers have grown accustomed to when accessing relational data. Because of this limitation, developers have had to live with DataSet’s limited query mechanism (via simple string expressions for sorting or filtering) or have had to build their own custom rich query implementations on top of DataSet. This has lead to a lot of extra work and custom application code that must be maintained by developers.
As part of .NET Framework 3.5, Microsoft introduces support for a technology called Language Integrated Query (LINQ). LINQ exposes a common query abstraction integrated into the .NET programming languages (C# and Visual Basic) for querying all kinds of data including objects, XML, and relational data. Hence, supporting LINQ to DataSets seemed like a natural fit.
All code snippets in this article will run against a sample instance of DataSet created with the code in Listing 1. This code creates a DataSet with three tables: National Parks, Countries, and Continents. The DataSet also contains a DataRelation from Continents to Countries. After creating the DataTables and DataSet, data is added to the sample DataSet instance.
In Listing 2 you’ll see a small utility method for dumping the query results to the console if required. All of the samples in this article will run against the sample DataSet mentioned earlier, making it possible to use the DumpResults method to execute queries and examine the results.
LINQ to DataSet-What Every Developer Needs to Know
LINQ queries target IEnumerable<T> sources. Unfortunately, the DataSet’s DataTable and the DataTable’s DataRowCollection do not implement IEnumerable<DataRow> so these types cannot be used as sources for LINQ query expressions. To work around this issue, an extension method called AsEnumerable was added to the DataTable type. This extension method takes a source DataTable and wraps it in an object of the type EnumerableRowCollection which implements IEnumerable<DataRow>. This allows DataTables to be a source for LINQ query expressions.
With the exception of some DataSet-specific LINQ query operators and features that I will discuss later in this article, DataSet’s LINQ implementation does not differ from LINQ to Objects. In fact, the DataTable.AsEnumerable method turns a given DataTable into a sequence of objects where the element type is DataRow. To illustrate this point, the following code shows two, equivalent, LINQ query expressions both using the standard LINQ to Objects implementation of the Where query operator to apply the predicate to the IEnumerable<DataRow> source.
var query1 = from park in parksDataTable.AsEnumerable()
where (string) park["Country"] == "Japan"
select park;
List<DataRow> list =
new List<DataRow>(
parksDataTable.Rows.Cast<DataRow>());
var queryLinqtoObjects = from park in list
where (string)park["Country"] == "Japan"
select park;
When writing LINQ queries against DataSet, one needs to be aware that the element type of the source is specifically DataRow. The following code snippet demonstrates why this is true.
var query2 = from park in parksDataTable.AsEnumerable()
where !park.IsNull("Rating") &&
(int) park["Rating"] > 8
select park;
There are actually two interesting issues here. First, when accessing the column values of a given DataRow, one must call the DataRow indexer. Since the column values can be any type, the indexer return type is object instead of the specific type of column. In other words, the DataRow column value accessor is weakly typed. This allows you to use a single method to retrieve column values independent of type. On the downside, it makes writing LINQ to DataSet queries awkward and error prone, given that the user must remember to cast to the desired type. Second, the DataSet represents null values as the DBNull constant which cannot be cast to any CLR value type; this makes writing LINQ queries tricky since the developer must remember to guard all indexer calls with a call to the DataRow.IsNull method to verify that a column value is not equal to DBNull, before accessing the value and unboxing it to the desired type. If this guard is not put in place, a runtime exception will occur when the value is DBNull and the code tries to unbox the value.
To make writing LINQ to DataSet queries easier and less error prone, a generic extension method called Field<T> was added to the DataRow class. Essentially the Field<T> method works the same as the DataRow indexer but allows the user to specify the static type for the return value of the method as the generic parameter. More importantly, if the user specifies a nullable type as the generic parameter, the field method will convert any DBNull values to null. Using the field method, here is the equivalent LINQ query from the prior example.
var query3 = from park in parksDataTable.AsEnumerable()
where park.Field<int?>("Rating") > 8
select park;
As a general rule, all access to DataRow column values should be done through the Field<T> method when developing LINQ to DataSet query expressions. This makes the query expression much less error prone and generally provides much cleaner code.
As a general rule, all access to DataRow column values should be done through the Field<T> method when developing LINQ to DataSet query expressions. It makes the query expression much less error prone and generally provides much cleaner code.
Since DataSet is an entirely in-memory cache, all LINQ queries against it are translated directly into .NET IL (intermediate language). This is different from LINQ technologies such as LINQ to SQL or LINQ to Entities which require the LINQ query to be translated to the target source’s query language, and are therefore limited by what can be translated. As a result, LINQ to DataSet queries can be arbitrarily complex and can contain anything that can be expressed in the host language. For example, here is a LINQ to DataSet query that could not be easily translated to SQL.
Func<long, bool> checkPopulation = delegate(long i)
{ return i < 100000000;};
var query4 = from country in
countriesDataTable.AsEnumerable()
where country.Field<string>("Name").Length > 3 &&
!country.HasErrors &&
checkPopulation(
country.Field<long>("Population"))
select country;
When working with LINQ to DataSet, you should also keep in mind how the LINQ set query operators (Union, Distinct, Intersection, Except) work. These operators all have two overloads, one which takes an IEqualityComparer<T> and one that does not. When calling one of the set query operators where the source(s) is IEnumerable<DataRow>, the set operator implementation will call DataRow.Equals() to compare elements when an IEqualityComparer<DataRow> comparer is not passed in. For developers familiar with the well known relational semantics of set operators, the results will be quite unexpected since the DataRow.Equals method will compare DataRow references and not the underlying values.
To support more traditional relational semantics LINQ to DataSet includes a type called DataRowComparer which includes a default DataRow comparer for comparing two DataRows.
var source1 = from park in parksDataTable.AsEnumerable()
where park.Field<int?>("Rating") < 8
select park;
var source2 = from park in parksDataTable.AsEnumerable()
where park.Field<int>("YearEstablished") > 1960
select park;
var unionResults = source1.Union(source2,
DataRowComparer.Default);
This comparer compares the specific row values and column types to decide if two rows are equal. It does not require that other metadata (i.e., column names) match since in most cases these may differ.
Shaping the Query Results-Beyond DataRows
LINQ query developers often want to shape their query results to meet specific application needs. The simplest example of this is using the LINQ Select query operator to project the results of a query expression.
var query5 = from park in parksDataTable.AsEnumerable()
orderby park.Field<int>("YearEstablished")
select new
{
ParkName = park.Field<string>("Name"),
Established =
park.Field<int>("YearEstablished")
};
When projecting the results of a LINQ to DataSet query, the fact that the values must be projected into a type other than DataRow can be limiting. This must be done because DataRows cannot be created that do not tie to an existing DataTable. This means that the DataRow column values are copied to the new instance based on the specified selector and hence any changes to the new instance are not reflected in the original DataRow.
Another detail to note-when defining the anonymous type in the projection, a name must be specified for each member because the value is being selected via a method and not via a Property or Field member. This is not really a huge issue, but a minor usability issue when projecting from DataRows.
As a workaround you can create DataRows via the DataTable’s DataRow factory method (DataRowCollection.Add) in the projection selector expression, or you can project the entire DataRow into a member of a new type instance.
var query6 = from park in parksDataTable.AsEnumerable()
orderby park.Field<int>("YearEstablished")
select new
{
ParkName = park.Field<string>("Name"),
Established =
park.Field<int>("YearEstablished"),
DataRow = park
};
You might also find it interesting to group results based on a specific key.
var query7 = from park in parksDataTable.AsEnumerable()
group park by park.Field<string>("Country") into g
select new {Country = g.Key, Count = g.Count()};
Note that the key selector for the Group By query operator uses the Field<T> method instead of the DataRow indexer. Although using the DataRow indexer could work, there is an even more sinister problem if one forgets to not unbox the return value of the indexer. Namely, the compiler won’t complain if one doesn’t include the proper cast expression. In some cases, this will mean that object references, instead of the underlying values, will be used to do comparisons between key values, most likely not the desired result. As stated before, always use the DataRow.Field<T> method to avoid these issues when developing LINQ to DataSet queries.
As most developers experienced in SQL programming know, the Join operator is an invaluable tool for querying relational databases where the data has been normalized into multiple tables. Interestingly, DataSet already has the ability to do joins via DataRelation objects created by the developer to relate multiple DataTables. This makes it entirely possible that when using LINQ to DataSet one will not require the use of the Join query operator as much as they would with the other LINQ technologies. The following sample demonstrates a Join between two DataTables without a DataRelation and the use of an existing DataRelation.
var query8 = from park in parksDataTable.AsEnumerable()
join country in countriesDataTable.AsEnumerable()
on park.Field<string>("Country") equals
country.Field<string>("Name")
select new
{
ParkName = park.Field<string>("Name"),
Country = country.Field<string>("Name"),
Continent = country.
GetParentRow(
ContinentCountryDataRelation)
.Field<string>("Name")
};
Developers can choose to use either the LINQ Join query operator or DataRelations. When using a DataRelation, the membership of the join is pre-calculated and kept up to date based on changes to the related DataTables. When using the LINQ Join query operator the developer does not have to maintain DataSet DataRelations; although one may require DataRelations for other functionality like constraints.
DataSet Specific LINQ Operators
In addition to the standard LINQ query operators, LINQ to DataSet includes two special LINQ query operators called CopyToDataTable and AsLinqDataView designed specifically to work with DataTables and IEnumerable<DataRow> sources. When using LINQ to query a DataTable, the results of the query are often returned as an IEnumerable<DataRow> instance. This raises two potentially interesting issues for developers. First, even though the individual rows in the query result are live (the individual elements in the sequence are the same DataRows as those from the original source DataTable(s)) the results of the query are not live. For example, after modifying one of the DataRows within the set of query results so that it no longer qualifies as part of the query results set, it will remain in the set of results until the query is run again. Second, much of the .NET infrastructure (i.e., binding, serialization, etc.) has been built to work with DataTables and DataViews instead of IEnumerable<DataRow> sources. Hence, it becomes desirable in some cases to take the results on a LINQ to DataSet query and turn it into a DataView or load it into a DataTable.
The AsLinqDataView query operator takes a LINQ to DataSet query and turns it into a DataView instance. Although similar to any standard DataSet DataView, this instance requires you to specify the predicate and order by selectors in the LINQ to DataSet query as lambda expressions instead of the string-based query language currently supported when creating DataViews.
DataView dataView = (from park in
parksDataTable.AsEnumerable()
where park.Field<int?>("YearEstablished") < 1960
orderby park.Field<string>("Country")
select park).AsDataView();
DataRowView[] drv = dataView.FindRows("Canada");
Note that when calling the AsLinqDataView method, the implementation will create a DataView index based on the criteria in the query. This index, maintained on changes to the underlying source DataRows or changes to the DataRowView instances, is based on the key selector(s) specified in the OrderBy clause so that any lookups via the DataView happen via the underlying index and will generally perform much better than calling the Where query operator.
If you do not require live data and do not want the overhead associated with maintaining it, you are able to copy the results of a LINQ to DataSet query expression into a new or existing DataTable via the CopyToDataTable query operator.
DataTable newParksTable = (from park in
parksDataTable.AsEnumerable()
where park.Field<int?>("YearEstablished") < 1960
orderby park.Field<int?>("YearEstablished")
select park).CopyToDataTable();
When using this operator, the DataRow column values are copied into new DataRows, meaning that the data is no longer live with respect to the original source. It is up to the developer to merge back any changes to the original source DataTable.
Conclusion
You’ll encounter many pleasures when using the LINQ technology, such as discovering where you can use it to solve common coding problems, often in significantly fewer lines of code and in a much more maintainable manner. Since LINQ to DataSet supports all standard LINQ operators and expressions, it supports a significant amount of functionality, well beyond most of the capabilities of typical query mechanisms. Hence, one should be open to experimenting with LINQ to DataSet to solve programming problems beyond those solved by simple filtering, ordering, and projecting operations demonstrated by the samples in this article.
Since LINQ to DataSet supports all standard LINQ operators and expressions, there is actually a significant amount of functionality supported which is well beyond most of the capabilities of typical query mechanisms.