Chapter 5
The driving force behind the language enhancements to C# 3.0 was LINQ. The new features and the implementation of those features were driven by the need to support deferred queries, translate queries into SQL to support LINQ to SQL, and add a unifying syntax to the various data stores. Chapter 4 shows you how the new language features can be used for many development idioms in addition to data query. This chapter concentrates on using those new features for querying data, regardless of source.
A goal of LINQ is that language elements perform the same work no matter what the data source is. However, even though the syntax works with all kinds of data sources, the query provider that connects your query to the actual data source is free to implement that behavior in a variety of ways. If you understand the various behaviors, it will make it easier to work with various data sources transparently. If you need to, you can even create your own data provider.
Item 36: Understand How Query Expressions Map to Method Calls
LINQ is built on two concepts: a query language, and a translation from that query language into a set of methods. The C# compiler converts query expressions written in that query language into method calls.
Every query expression has a mapping to a method call or calls. You should understand this mapping from two perspectives. From the perspective of a class user, you need to understand that your query expressions are nothing more than method calls. A where clause translates to a call to a method named Where(), with the proper set of parameters. As a class designer, you should evaluate the implementations of those methods provided by the base framework and determine whether you can create better implementations for your types. If not, you should simply defer to the base library versions. However, when you can create a better version, you must make sure that you fully understand the translation from query expressions into method calls. It’s your responsibility to ensure that your method signatures correctly handle every translation case. For some of the query expressions, the correct path is rather obvious. However, it’s a little more difficult to comprehend a couple of the more complicated expressions.
The full query expression pattern contains eleven methods. The following is the definition from The C# Programming Language, Third Edition, by Anders Hejlsberg, Mads Torgersen, Scott Wiltamuth, and Peter Golde (Microsoft Corporation, 2009), §7.15.3 (reprinted with permission from Microsoft Corporation):
delegate R Func<T1,R>(T1 arg1);
delegate R Func<T1,T2,R>(T1 arg1, T2 arg2);
class C
{
public C<T> Cast<T>();
}
class C<T> : C
{
public C<T> Where(Func<T,bool> predicate);
public C<U> Select<U>(Func<T,U> selector);
public C<V> SelectMany<U,V>(Func<T,C<U>> selector,
Func<T,U,V> resultSelector);
public C<V> Join<U,K,V>(C<U> inner,
Func<T,K> outerKeySelector,
Func<U,K> innerKeySelector,
Func<T,U,V> resultSelector);
public C<V> GroupJoin<U,K,V>(C<U> inner,
Func<T,K> outerKeySelector,
Func<U,K> innerKeySelector,
Func<T,C<U>,V> resultSelector);
public O<T> OrderBy<K>(Func<T,K> keySelector);
public O<T> OrderByDescending<K>(Func<T,K> keySelector);
public C<G<K,T>> GroupBy<K>(Func<T,K> keySelector);
public C<G<K,E>> GroupBy<K,E>(Func<T,K> keySelector,
Func<T,E> elementSelector);}
class O<T> : C<T>
{
public O<T> ThenBy<K>(Func<T,K> keySelector);
public O<T> ThenByDescending<K>(Func<T,K> keySelector);
}
class G<K,T> : C<T>
{
public K Key { get; }
}
The .NET base library provides two general-purpose reference implementations of this pattern. System.Linq.Enumerable provides extension methods on IEnumerable<T> that implement the query expression pattern. System.Linq.Queryable provides a similar set of extension methods on IQueryable<T> that supports a query provider’s ability to translate queries into another format for execution. (For example, the LINQ to SQL implementation converts query expressions into SQL queries that are executed by the SQL database engine.) As a class user, you are probably using one of those two reference implementations for most of your queries.
Second, as a class author, you can create a data source that implements IEnumerable<T> or IQueryable<T> (or a closed generic type from IEnumerable<T> or IQueryable<T>), and in that case your type already implements the query expression pattern. Your type has that implementation because you’re using the extension methods defined in the base library.
Before we go further, you should understand that the C# language does not enforce any execution semantics on the query expression pattern. You can create a method that matches the signature of one of the query methods and does anything internally. The compiler cannot verify that your Where method satisfies the expectations of the query expression pattern. All it can do is ensure that the syntactic contract is satisfied. This behavior isn’t any different from that of any interface method. For example, you can create an interface method that does anything, whether or not it meets users’ expectations.
Of course, this doesn’t mean that you should ever consider such a plan. If you implement any of the query expression pattern methods, you should ensure that its behavior is consistent with the reference implementations, both syntactically and semantically. Except for performance differences, callers should not be able to determine whether your method is being used or the reference implementations are being used.
Translating from query expressions to method invocations is a complicated iterative process. The compiler repeatedly translates expressions to methods until all expressions have been translated. Furthermore, the compiler has a specified order in which it performs these translations, although I’m not explaining them in that order. The compiler order is easy for the compiler and is documented in the C# specification. I chose an order that makes it easier to explain to humans. For our purposes, I discuss some of the translations in smaller, simpler examples.
In the following query, let’s examine the where, select, and range variables:
int[] numbers = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
var smallNumbers = from n in numbers
where n < 5
select n;
The expression from n in numbers binds the range variable n to each value in numbers. The where clause defines a filter that will be translated into a where method. The expression where n < 5 translates to the following:
numbers.Where((n) => n < 5);
Where is nothing more than a filter. The output of Where is a proper subset of the input sequence containing only those elements that satisfy the predicate. The input and output sequences must contain the same type, and a correct Where method must not modify the items in the input sequence. (User-defined predicates may modify items, but that’s not the responsibility of the query expression pattern.)
That where method can be implemented either as an instance method accessible to numbers or as an extension method matching the type of numbers. In the example, numbers is an array of int. Therefore, n in the method call must be an integer.
Where is the simplest of the translations from query expression to method call. Before we go on, let’s dig a little deeper into how this works and what that means for the translations. The compiler completes its translation from query expression to method call before any overload resolution or type binding. The compiler does not know whether there are any candidate methods when the compiler translates the query expression to a method call. It doesn’t examine the type, and it doesn’t look for any candidate extension methods. It simply translates the query expression into the method call. After all queries have been translated into method call syntax, the compiler performs the work of searching for candidate methods and then determining the best match.
Next, you can extend that simple example to include the select expression in the query. Select clauses are translated into Select methods. However, in certain special cases the Select method can be optimized away. The sample query is a degenerate select, selecting the range variable. Degenerate select queries can be optimized away, because the output sequence is not equal to the input sequence. The sample query has a where clause, which breaks that identity relationship between the input sequence and the output sequence. Therefore, the final method call version of the query is this:
var smallNumbers = numbers.Where(n => n < 5);
The select clause is removed because it is redundant. That’s safe because the select operates on an immediate result from another query expression (in this example, where).
When the select does not operate on the immediate result of another expression, it cannot be optimized away. Consider this query:
var allNumbers = from n in numbers select n;
It will be translated into this method call:
var allNumbers = numbers.Select(n => n);
While we’re on this subject, note that select is often used to transform or project one input element into a different element or into a different type. The following query modifies the value of the result:
int[] numbers = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
var smallNumbers = from n in numbers
where n < 5
select n * n;
Or you could transform the input sequence into a different type as follows:
int [] numbers = {0,1,2,3,4,5,6,7,8,9};
var squares = from n in numbers
select new { Number = n, Square = n * n};
The select clause maps to a Select method that matches the signature in the query expression pattern:
var squares = numbers.Select(n =>
new { Number = n, Square = n * n});
Select transforms the input type into the output type. A proper select method must produce exactly one output element for each input element. Also, a proper implementation of Select must not modify the items in the input sequence.
That’s the end of the simpler query expressions. Now we discuss some of the less obvious transformations.
Ordering relations map to the OrderBy and ThenBy methods, or OrderByDescending and ThenByDescending. Consider this query:
var people = from e in employees
where e.Age > 30
orderby e.LastName, e.FirstName, e.Age
select e;
It translates into this:
var people = employees.Where(e => e.Age > 30).
OrderBy(e => e.LastName).
ThenBy(e => e.FirstName).
ThenBy(e => e.Age);
Notice in the definition of the query expression pattern that ThenBy operates on a sequence returned by OrderBy or ThenBy. Those sequences can contain markers that enable ThenBy to operate on the sorted subranges when the sort keys are equal.
This transformation is not the same if the orderby clauses are expressed as different clauses. The following query sorts the sequence entirely by LastName, then sorts the entire sequence again by FirstName, and then sorts again by Age:
// Not correct. Sorts the entire sequence three times.
var people = from e in employees
where e.Age > 30
orderby e.LastName
orderby e.FirstName
orderby e.Age
select e;
As separate queries, you could specify that any of the orderby clauses use descending order:
var people = from e in employees
where e.Age > 30
orderby e.LastName descending
thenby e.FirstName
thenby e.Age
select e;
The OrderBy method creates a different sequence type as its output so that thenby clauses can be more efficient and so that the types are correct for the overall query. OrderBy cannot operate on an unordered sequence, only on a sorted sequence (typed as O<T> in the sample). Subranges are already sorted and marked. If you create your own orderby and thenby methods for a type, you must adhere to this rule. You’ll need to add an identifier to each sorted subrange so that any subsequent thenby clause can work properly. ThenBy methods need to be typed to take the output of an OrderBy or ThenBy method and then sort each subrange correctly.
Everything I’ve said about OrderBy and ThenBy also applies to OrderByDescending and ThenByDescending. In fact, if your type has a custom version of any of those methods, you should almost always implement all four of them.
The remaining expression translations involve multiple steps. Those queries involve either groupings or multiple from clauses that introduce continuations. Query expressions that contain continuations are translated into nested queries. Then those nested queries are translated into methods. Following is a simple query with a continuation:
var results = from e in employees
group e by e.Department into d
select new { Department = d.Key,
Size = d.Count() };
Before any other translations are performed, the continuation is translated into a nested query:
var results = from d in
from e in employees group e by e.Department
select new { Department = d.Key, Size = d.Count()};
Once the nested query is created, the methods translate into the following:
var results = employees.GroupBy(e => e.Department).
Select(d => new { Department = d.Key, Size = d.Count()});
The foregoing query shows a GroupBy that returns a single sequence. The other GroupBy method in the query expression pattern returns a sequence of groups in which each group contains a key and a list of values:
var results = from e in employees
group e by e.Department into d
select new { Department = d.Key,
Employees = d.AsEnumerable()};
That query maps to the following method calls:
var results2 = employees.GroupBy(e => e.Department).
Select(d => new { Department = d.Key,
Employees = d.AsEnumerable()});
GroupBy methods produce a sequence of key/value list pairs; the keys are the group selectors, and the values are the sequence of items in the group. The query select clause may create new objects for the values in each group. However, the output should always be a sequence of key/value pairs in which the value contains some element created by each item in the input sequence that belongs to that particular group.
The final methods to understand are SelectMany, Join, and GroupJoin. These three methods are complicated, because they work with multiple input sequences. The methods that implement these translations perform the enumerations across multiple sequences and then flatten the resulting sequences into a single output sequence. SelectMany performs a cross join on the two source sequences. For example, consider this query:
int[] odds = {1,3,5,7};
int[] evens = {2,4,6,8};
var pairs = from oddNumber in odds
from evenNumber in evens
select new {oddNumber, evenNumber,
Sum=oddNumber+evenNumber};
It produces a sequence having 16 elements:
1,2, 3
1,4, 5
1,6, 7
1,8, 9
3,2, 5
3,4, 7
3,6, 9
3,8, 11
5,2, 7
5,4, 9
5,6, 11
5,8, 13
7,2, 9
7,4, 11
7,6, 13
7,8, 15
Query expressions that contain multiple select clauses are translated into a SelectMany method call. The sample query would be translated into the following SelectMany call:
int[] odds = { 1, 3, 5, 7 };
int[] evens = { 2, 4, 6, 8 };
var values = odds.SelectMany(oddNumber => evens,
(oddNumber, evenNumber) =>
new { oddNumber, evenNumber,
Sum = oddNumber + evenNumber });
The first parameter to SelectMany is a function that maps each element in the first source sequence to the sequence of elements in the second source sequence. The second parameter (the output selector) creates the projections from the pairs of items in both sequences.
SelectMany() iterates the first sequence. For each value in the first sequence, it iterates the second sequence, producing the result value from the pair of input values. The output selected is called for each element in a flattened sequence of every combination of values from both sequences. One possible implementation of SelectMany is as follows:
static IEnumerable<TOutput> SelectMany<T1, T2, TOutput>(
this IEnumerable<T1> src,
Func<T1, IEnumerable<T2>> inputSelector,
Func<T1, T2, TOutput> resultSelector)
{
foreach (T1 first in src)
{
foreach (T2 second in inputSelector(first))
yield return resultSelector(first, second);
}
}
The first input sequence is iterated. Then the second input sequence is iterated using the current value on the input sequence. That’s important, because the input selector on the second sequence may depend on the current value in the first sequence. Then, as each pair of elements is generated, the result selector is called on each pair.
If your query has more expressions and if SelectMany does not create the final result, then SelectMany creates a tuple that contains one item from each input sequence. Sequences of that tuple are the input sequence for later expressions. For example, consider this modified version of the original query:
int[] odds = { 1, 3, 5, 7 };
int[] evens = { 2, 4, 6, 8 };
var values = from oddNumber in odds
from evenNumber in evens
where oddNumber > evenNumber
select new { oddNumber, evenNumber,
Sum = oddNumber + evenNumber };
It produces this SelectMany method call:
odds.SelectMany(oddNumber => evens,
(oddNumber, evenNumber) =>
new {oddNumber, evenNumber});
The full query is then translated into this statement:
var values = odds.SelectMany(oddNumber => evens,
(oddNumber, evenNumber) =>
new { oddNumber, evenNumber }).
Where(pair => pair.oddNumber > pair.evenNumber).
Select(pair => new {
pair.oddNumber,
pair.evenNumber,
Sum = pair.oddNumber + pair.evenNumber });
You can see another interesting property in the way SelectMany gets treated when the compiler translates multiple from clauses into SelectMany method calls. SelectMany composes well. More than two from clauses will produce more than one SelectMany() method call. The resulting pair from the first SelectMany() call will be fed into the second SelectMany(), which will produce a triple. The triple will contain all combinations of all three sequences. Consider this query:
var triples = from n in new int[] { 1, 2, 3 }
from s in new string[] { "one", "two", "three" }
from r in new string[] { "I", "II", "III" }
select new { Arabic = n, Word = s, Roman = r };
It will be translated into the following method calls:
var numbers = new int[] {1,2,3};
var words = new string[] {"one", "two", "three"};
var romanNumerals = new string[] { "I", "II", "III" };
var triples = numbers.SelectMany(n => words,
(n, s) => new { n, s}).
SelectMany(pair => romanNumerals,
(pair,n) =>
new { Arabic = pair.n, Word = pair.s, Roman = n });
As you can see, you can extend from three to any arbitrary number of input sequences by applying more SelectMany() calls. These later examples also demonstrate how SelectMany can introduce anonymous types into your queries. The sequence returned from SelectMany() is a sequence of some anonymous type.
Now let’s look at the two other translations you need to understand: Join and GroupJoin. Both are applied on join expressions. GroupJoin is always used when the join expression contains an into clause. Join is used when the join expression does not contain an into clause.
A join without an into looks like this:
var numbers = new int[] { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
var labels = new string[] { "0", "1", "2", "3", "4", "5" };
var query = from num in numbers
join label in labels on num.ToString() equals label
select new { num, label };
It translates into the following:
var query = numbers.Join(labels, num => num.ToString(),
label => label, (num, label) => new { num, label });
The into clause creates a list of subdivided results:
var groups = from p in projects
join t in tasks on p equals t.Parent into projTasks
select new { Project = p, projTasks };
That translates into a GroupJoin:
var groups = projects.GroupJoin(tasks,
p => p, t => t.Parent, (p, projTasks) =>
new { Project = p, TaskList = projTasks });
The entire process of converting all expressions into method calls is complicated and often takes several steps.
The good news is that for the most part, you can happily go about your work secure in the knowledge that the compiler does the correct translation. And because your type implements IEnumerable<T>, users of your type are getting the correct behavior.
But you may have that nagging urge to create your own version of one or more of the methods that implement the query expression pattern. Maybe your collection type is always sorted on a certain key, and you can short-circuit the OrderBy method. Maybe your type exposes lists of lists, and this means that you may find that GroupBy and GroupJoin can be implemented more efficiently.
More ambitiously, maybe you intend to create your own provider and you’ll implement the entire pattern. That being the case, you need to understand the behavior of each query method and know what should go into your implementation. Refer to the examples, and make sure you understand the expected behavior of each query method before you embark on creating your own implementations.
Many of the custom types you define model some kind of collection. The developers who use your types will expect to use your collections in the same way that they use every other collection type, with the built-in query syntax. As long as you support the IEnumerable<T> interface for any type that models a collection, you’ll meet that expectation. However, your types may be able to improve on the default implementation by using the internal specifics in your type. When you choose to do that, ensure that your type matches the contract from the query pattern in all forms.
Item 37: Prefer Lazy Evaluation Queries
When you define a query, you don’t actually get the data and populate a sequence. You are actually defining only the set of steps that you will execute when you choose to iterate that query. This means that each time you execute a query, you perform the entire recipe from first principles. That’s usually the right behavior. Each new enumeration produces new results, in what is called lazy evaluation. However, often that’s not what you want. When you grab a set of variables, you want to retrieve them once and retrieve them now, in what is called eager evaluation.
Every time you write a query that you plan to enumerate more than once, you need to consider which behavior you want. Do you want a snapshot of your data, or do you want to create a description of the code you will execute in order to create the sequence of values?
This concept is a major change in the way you are likely accustomed to working. You probably view code as something that is executed immediately. However, with LINQ queries, you’re injecting code into a method. That code will be invoked at a later time. More than that, if the provider uses expression trees instead of delegates, those expression trees can be combined later by combining new expressions into the same expression tree.
Let’s start with an example to explain the difference between lazy and eager evaluation. The following bit of code generates a sequence and then iterates that sequence three times, with a pause between iterations.
private static IEnumerable<TResult>
Generate<TResult>(int number, Func<TResult> generator)
{
for (int i = 0; i < number; i++)
yield return generator();
}
private static void LazyEvaluation()
{
Console.WriteLine("Start time for Test One: {0}",
DateTime.Now);
var sequence = Generate(10, () => DateTime.Now);
Console.WriteLine("Waiting....\tPress Return");
Console.ReadLine();
Console.WriteLine("Iterating...");
foreach (var value in sequence)
Console.WriteLine(value);
Console.WriteLine("Waiting....\tPress Return");
Console.ReadLine();
Console.WriteLine("Iterating...");
foreach (var value in sequence)
Console.WriteLine(value);
}
Here’s one sample output:
Start time for Test One: 11/18/2007 6:43:23 PM
Waiting.... Press Return
Iterating...
11/18/2007 6:43:31 PM
11/18/2007 6:43:31 PM
11/18/2007 6:43:31 PM
11/18/2007 6:43:31 PM
11/18/2007 6:43:31 PM
11/18/2007 6:43:31 PM
11/18/2007 6:43:31 PM
11/18/2007 6:43:31 PM
11/18/2007 6:43:31 PM
11/18/2007 6:43:31 PM
Waiting.... Press Return
Iterating...
11/18/2007 6:43:42 PM
11/18/2007 6:43:42 PM
11/18/2007 6:43:42 PM
11/18/2007 6:43:42 PM
11/18/2007 6:43:42 PM
11/18/2007 6:43:42 PM
11/18/2007 6:43:42 PM
11/18/2007 6:43:42 PM
11/18/2007 6:43:42 PM
11/18/2007 6:43:42 PM
In this example of lazy evaluation, notice that the sequence is generated each time it is iterated, as evidenced by the different time stamps. The sequence variable does not hold the elements created. Rather, it holds the expression tree that can create the sequence. You should run this code yourself, stepping into each query to see exactly when the expressions are evaluated. It’s the most instructive way to learn how LINQ queries are evaluated.
You can use this capability to compose queries from existing queries. Instead of retrieving the results from the first query and processing them as a separate step, you can compose queries in different steps and then execute the composed query only once. For example, suppose I modify the query to return times in universal format:
var sequence1 = Generate(10, () => DateTime.Now);
var sequence2 = from value in sequence1
select value.ToUniversalTime();
Sequence 1 and sequence 2 share functional composition, not data. Sequence 2 is not built by enumerating the values in sequence 1 and modifying each value. Rather, it is created by executing the code that produces sequence 1, followed by the code that produces sequence 2. If you iterate the two sequences at different times, you’ll see unrelated sequences. Sequence 2 will not contain the converted values from sequence 1. Instead, it will contain totally new values. It doesn’t generate a sequence of dates and then convert the entire sequence into universal time. Instead, each line of code generates one set of values using universal time.
Query expressions may operate on infinite sequences. They can do so because they are lazy. If written correctly, they examine the first portion of the sequence and then terminate when an answer is found. On the other hand, some query expressions must retrieve the entire sequence before they can proceed to create their answer. Understanding when these bottlenecks might occur will help you create queries that are natural without incurring performance penalties. In addition, this understanding will help you avoid those times when the full sequence is required and will create a bottleneck.
Consider this small program:
static void Main(string[] args)
{
var answers = from number in AllNumbers()
select number;
var smallNumbers = answers.Take(10);
foreach (var num in smallNumbers)
Console.WriteLine(num);
}
static IEnumerable<int> AllNumbers()
{
int number = 0;
while (number < int.MaxValue)
{
yield return number++;
}
}
This sample illustrates what I mean about a method that does not need the full sequence. The output from this method is the sequence of numbers 0,1,2,3,4,5,6,7,8,9. That’s the case even though the AllNumbers() method could generate an infinite sequence. (Yes, it eventually has an overflow, but you’ll lose patience long before then.)
The reason this works as quickly as it does is that the entire sequence is not needed. The Take() method returns the first N objects from the sequence, so nothing else matters.
However, if you rewrite this query as follows, your program will run forever:
class Program
{
static void Main(string[] args)
{
var answers = from number in AllNumbers()
where number < 10
select number;
foreach(var num in answers)
Console.WriteLine(num);
}
}
It runs forever because the query must examine every single number to determine which methods match. This version of the same logic requires the entire sequence.
There are a number of query operators that must have the entire sequence in order to operate correctly. Where uses the entire sequence. Orderby needs the entire sequence to be present. Max and Min need the entire sequence. There’s no way to perform these operations without examining every element in the sequence. When you need these capabilities, you’ll use these methods.
You need to think about the consequences of using methods that require access to the entire sequence. As you’ve seen, you need to avoid any methods that require the entire sequence if the sequence might be infinite. Second, even if the sequence is not infinite, any query methods that filter the sequence should be front-loaded in the query. If the first steps in your query remove some of the elements from the collection, that will have a positive effect on the performance of the rest of the query.
For example, the following two queries produce the same result. However, the second query may execute faster. Sophisticated providers will optimize the query, and both queries will have the same performance metrics. However, in the LINQ to Objects implementation (provided by System.Linq.Enumerable), all products are read and sorted. Then the products sequence is filtered.
// Order before filter.
var sortedProductsSlow =
from p in products
orderby p.UnitsInStock descending
where p.UnitsInStock > 100
select p;
// Filter before order.
var sortedProductsFast =
from p in products
where p.UnitsInStock > 100
orderby p.UnitsInStock descending
select p;
Notice that the first query sorts the entire series and then throws away any products whose total in stock is less than 100. The second query filters the sequence first, resulting in a sort on what may be a much smaller sequence. At times, knowing whether the full sequence is needed for a method is the difference between an algorithm that never finishes and one that finishes quickly. You need to understand which methods require the full sequence, and try to execute those last in your query expression.
So far, I’ve given you quite a few reasons to use lazy evaluation in your queries. In most cases, that’s the best approach. At other times, though, you do want a snapshot of the values taken at a point in time. There are two methods you can use to generate the sequence immediately and store the results in a container: ToList() and ToArray(). Both methods perform the query and store the results in a List<T> or an Array, respectively.
These methods are useful for a couple of purposes. By forcing the query to execute immediately, these methods capture a snapshot of the data right now. You force the execution to happen immediately, rather than later when you decide to enumerate the sequence. Also, you can use ToList() or ToArray() to generate a snapshot of query results that is not likely to change before you need it again. You can cache the results and use the saved version later.
In almost all cases, lazy evaluation saves work and is more versatile than eager evaluation. In the rare cases when you do need eager evaluation, you can force it by running the query and storing the sequence results using ToList() or ToArray(). But unless there is a clear need to use eager evaluation, it’s better to use lazy evaluation.
Item 38: Prefer Lambda Expressions to Methods
This recommendation may appear counterintuitive. Coding with lambda expressions can lead to repeated code in the body of lambdas. You often find yourself repeating small bits of logic. The following code snippet has the same logic repeated several times:
var allEmployees = FindAllEmployees();
// Find the first employees:
var earlyFolks = from e in allEmployees
where e.Classification == EmployeeType.Salary
where e.YearsOfService > 20
where e.MonthlySalary < 4000
select e;
// find the newest people:
var newest = from e in allEmployees
where e.Classification == EmployeeType.Salary
where e.YearsOfService < 2
where e.MonthlySalary < 4000
select e;
You could replace the multiple calls to Where with a single Where clause that has both conditions. There isn’t any noticeable difference between the two representations. Because queries compose (see Item 17, Chapter 3) and because simple where predicates will likely be inlined, the performance will be the same.
You may be tempted to factor repeated lambda expressions into methods that can be reused. You’d end up with code that looks like this:
// factor out method:
private static bool LowPaidSalaried(Employee e)
{
return e.MonthlySalary < 4000 &&
e.Classification == EmployeeType.Salary;
}
// elsewhere
var allEmployees = FindAllEmployees();
var earlyFolks = from e in allEmployees
where LowPaidSalaried(e) &&
e.YearsOfService > 20
select e;
// find the newest people:
var newest = from e in allEmployees
where LowPaidSalaried(e) && e.YearsOfService < 2
select e;
It’s a small example, so there’s not much change here. But already it feels better. Now if the employee classifications change or if the low threshold changes, you’re changing the logic in only one location.
Unfortunately, this method of refactoring your code makes it less reusable. The first version, as written, is actually more reusable than the second version. That’s because of the way lambda expressions are evaluated, parsed, and eventually executed. If you’re like most developers, you see code that has been copied as pure evil and something to be eradicated at all costs. The version with a single method is simpler. It has only one copy of the code to be modified later if needs change. It’s just plain good software engineering.
Unfortunately, it’s also wrong. Some code will convert the lambda expressions into a delegate to execute the code in your query expression. Other classes will create an expression tree from the lambda expression, parse that expression, and execute it in another environment. LINQ to Objects does the former, and LINQ to SQL does the latter.
LINQ to Objects performs queries on local data stores, usually stored in a generic collection. The implementation creates an anonymous delegate that contains the logic in the lambda expression and executes that code. The LINQ to Objects extension methods use IEnumerable<T> as the input sequence.
LINQ to SQL, on the other hand, uses the expression tree contained in the query. That expression tree contains the logical representation of your query. LINQ to SQL parses the tree and uses the expression tree to create the proper T-SQL query, which can be executed directly against the database. Then, the query string (as T-SQL) is sent to the database engine and is executed there.
This processing requires that the LINQ to SQL engine parse the expression tree and replace every logical operation with equivalent SQL. All method calls are replaced with an Expression.MethodCall node. The LINQ to SQL engine cannot translate any arbitrary method call into a SQL expression. Instead, it throws an exception. The LINQ to SQL engine fails rather than try to execute multiple queries, bring multiple data to the client side of the application boundary, and then process it there.
If you are building any kind of reusable library for which the data source could be anything, you must anticipate this situation. You must structure the code so that it will work correctly with any data source. This means that you need to keep lambda expressions separate, and as inline code, for your library to function correctly.
Of course, this doesn’t mean that you should be copying code all over the library. It means only that you need to create different building blocks for your applications when query expressions and lambdas are involved. From our simple example, you can create larger reusable blocks this way:
private static IQueryable<Employee> LowPaidSalariedFilter
(this IQueryable<Employee> sequence)
{
return from s in sequence
where s.Classification == EmployeeType.Salary &&
s.MonthlySalary < 4000
select s;
}
// elsewhere:
var allEmployees = FindAllEmployees();
// Find the first employees:
var salaried = allEmployees.LowPaidSalariedFilter();
var earlyFolks = salaried.Where(e => e.YearsOfService > 20);
// find the newest people:
var newest = salaried.Where(e => e.YearsOfService < 2);
Of course, not every query is that simple to update. You need to move up the call chain a bit to find the reusable list-processing logic so that you need to express the same lambda expression only once. Recall from Item 17 (Chapter 3) that enumerator methods do not execute until you begin to traverse the items in the collection. Remembering that fact, you can create small methods that construct each portion of your query and contain commonly used lambda expressions. Each of those methods must take as input the sequence, and must return the sequence using the yield return keyword.
Following that same pattern, you can compose IQueryable enumerators by building new expression trees that can be executed remotely. Here, the expression tree for finding sets of employees can be composed as a query before it is executed. The IQueryProvider object (such as the LINQ to SQL data source) processes the full query rather than pull out parts that must be executed locally.
You then put together those small methods to build the larger queries you will use in your application. The advantage of this technique is that you avoid the code-copying issues that we all dislike in the first sample in this item. You also have structured the code so that it creates an expression tree for execution when you have composed your completed query and begin to execute it.
One of the most efficient ways to reuse lambda expressions in complicated queries is to create extension methods for those queries on closed generic types. You can see that the method for finding the lower-paid salaried employees is such a method. It takes a sequence of employees and returns a filtered sequence of employees. In production code, you should create a second overload that uses IEnumerable<Employee> as the parameter type. In that way, you support both the LINQ to SQL style implementations and the LINQ to Objects implementation.
You can build exactly the queries you need by composing the smaller building blocks from those methods that take lambda expressions and are sequence methods. You gain the advantage of creating code that works with IEnumerable<T> and IQueryable<T>. Furthermore, you haven’t broken the possible evaluation of the queryable expression trees.
Item 39: Avoid Throwing Exceptions in Functions and Actions
When you create code that executes over a sequence of values and the code throws an exception somewhere in that sequence processing, you’ll have problems recovering state. You don’t know how many elements were processed, if any. You don’t know what needs to be rolled back. You can’t restore the program state at all.
Consider this snippet of code, which gives everyone a 5 percent raise:
var allEmployees = FindAllEmployees();
allEmployees.ForEach(e => e.MonthlySalary *= 1.05M);
One day, this routine runs and throws an exception. Chances are that the exception was not thrown on the first or last employee. Some employees got raises, but others didn’t. It will be very difficult for your program to recover the previous state. Can you return the data to a consistent state? Once you lose knowledge of program state, you can’t regain it without human examination of all the data.
This problem occurs because the code snippet modifies elements of a sequence in place. It doesn’t follow the strong exception guarantee. In the face of errors, you can’t know what happened and what didn’t.
You fix this situation by guaranteeing that whenever the method does not complete, the observable program state does not change. You can implement this in various ways, each with its own benefits and risks.
Before talking about the risks, let’s examine the reason for concern in a bit more detail. Not every method exhibits this problem. Many methods examine a sequence but do not modify it. The following method examines everyone’s salary and returns the result:
decimal total = allEmployees.Aggregate(0M,
(sum, emp) => sum + emp.MonthlySalary);
You don’t need to carefully modify methods like this that do not modify any data in the sequence. In many applications, you’ll find that most of your methods do not modify the sequence. Let’s return again to our first method, giving every employee a 5 percent raise. What actions can you take to rework this method to ensure that the strong exception guarantee is satisfied?
The first and easiest approach is to rework the action so that you can ensure that the action method, expressed earlier in the lambda expression, never throws an exception. In many cases, it is possible to test any failure conditions before modifying each element in the sequence (see Item 25, Chapter 3). You need to define the functions and predicates so that the method’s contract can be satisfied in all cases, even error conditions. This strategy works if doing nothing is the right behavior for elements that caused the exception. In the example of granting raises, imagine that all exceptions are caused by employee records that are stale and include people who no longer work for the company but are still in persistent storage. That would make it correct behavior to skip them. This modification would work:
allEmployees.FindAll(
e => e.Classification == EmployeeType.Active).
ForEach(e => e.MonthlySalary *= 1.05M);
Fixing the problem in this way is the simplest path to avoiding inconsistencies in your algorithms. Whenever you can write your action methods to ensure that no exceptions leave a lambda expression or action method, that’s the most efficient technique to use.
However, sometimes you may not be able to guarantee that those expressions never throw an exception. Now you must take more-expensive defensive measures. You need to rework the algorithm to take into account the possibility of an exception. That means doing all the work on a copy and then replacing the original sequence with the copy only if the operation completes successfully. If you felt you could not avoid the possibility of an exception, you could rewrite our earlier algorithm:
var updates = (from e in allEmployees
select new Employee
{
EmployeeID = e.EmployeeID,
Classification = e.Classification,
YearsOfService = e.YearsOfService,
MonthlySalary = e.MonthlySalary *= 1.05M
}).ToList();
allEmployees = updates;
You can see the cost of those changes here. First, there’s quite a bit more code than in the earlier versions. That’s more work-more code to maintain and more to understand. But you’ve also changed the performance metrics for the application. This newer version creates a second copy of all the employee records and then swaps the reference to the new list of employees with the reference to the old list. If the employee list is large, that could cause a big performance bottleneck. You have created duplicates of all employees in the list before swapping references. The contract for the action now might throw an exception when the employee object is invalid. The code outside the query now handles those conditions.
And there’s still another issue with this particular fix: Whether or not it makes sense depends on how it’s used. This new version limits your ability to compose operations using multiple functions. This code snippet caches the full list. This means that its modifications aren’t composed along with other transformations in a single enumeration of the list. Each transformation becomes an imperative operation. In practice, you can work around this issue by creating one query statement that performs all the transformations. You cache the list and swap the entire sequence as one final step for all the transformations. Using that technique, you preserve the composability and still provide the strong exception guarantee.
In practice that means writing query expressions to return a new sequence rather than modifying each element of a sequence in place. Each composed query should be able to swap the list unless any exceptions are generated during the processing of any of the steps in the sequence.
Composing queries changes the way you write exception-safe code. If your actions or functions throw an exception, you may have no way to ensure that the data is not in an inconsistent state. You don’t know how many elements were processed. You don’t know what actions must be taken to restore the original state. However, returning new elements (rather than modifying the elements in place) gives you a better chance of ensuring that operations either complete or don’t modify any program state.
This is the same advice for all mutable methods when exceptions may be thrown. It also applies in multithreaded environments. The problem can be harder to spot when you use lambda expressions and the code inside them may throw the exception. With the final operation, you should swap the entire sequence after you are sure that none of the operations has generated an exception.
Item 40: Distinguish Early from Deferred Execution
Declarative code is expository: It defines what gets done. Imperative code details step-by-step instructions that explain how something gets done. Both are valid and can be used to create working programs. However, mixing the two causes unpredictable behavior in your application.
All the imperative code you make today will calculate any needed parameters and then call the method. This line of code describes an imperative set of steps to create the answer:
object answer = DoStuff(Method1(),
Method2(),
Method3());
At runtime, this line of code does the following.
- It calls Method1 to generate the first parameter to DoStuff().
- It calls Method2 to generate the second parameter to DoStuff().
- It calls Method3 to generate the third parameter to DoStuff().
- It calls DoStuff with the three calculated parameters.
That should be a familiar style of code for you. All parameters are calculated, and the data is sent to any method. The algorithms you write are a descriptive set of steps that must be followed to produce the results.
Deferred execution, in which you use lambdas and query expressions, completely changes this process and may pull the rug out from under you. The following line of code seems to do the same thing as the foregoing example, but you’ll soon see that there are important differences:
object answer = DoStuff(() => Method1(),
() => Method2(),
() => Method3());
At runtime, this line of code does the following.
- It calls DoStuff(), passing the lambda expressions that could call Method1, Method2, and Method3.
- Inside DoStuff, if and only if the result of Method1 is needed, Method1 is called.
- Inside DoStuff, if and only if the result of Method2 is needed, Method2 is called.
- Inside DoStuff, if and only if the result of Method3 is needed, Method3 is called.
- Method1, Method2, and Method3 may be called in any order, as many times (including zero) as needed.
None of those methods will be called unless the results are needed. This difference is significant, and you will cause yourself major problems if you mix the two idioms.
From the outside, any method can be replaced by its return value, and vice versa, as long as that method does not produce any side effects. In our example, the DoStuff() method does not see any difference between the two strategies. The same value is returned, and either strategy is correct. If the method always returns the same value for the same inputs, then the method return value can always be replaced by a call to the method, and vice versa.
However, looking at the program as a whole, there may be significant differences between the two lines of code. The imperative model always calls all three methods. Any side effects from any of those methods always occur exactly once. In contrast, the declarative model may or may not execute all or any of the methods. The declarative version may execute any of the methods more than once. This is the difference between (1) calling a method and passing the results to a method and (2) passing a delegate to the method and letting the method call the delegate. You may get different results from different runs of the application, depending on what actions take place in these methods.
The addition of lambda expressions, type inference, and enumerators makes it much easier to use functional programming concepts in your classes. You can build higher-order functions that take functions as parameters or that return functions to their callers. In one way, this is not a big change: A true function and its return value are always interchangeable. In practice, a function may have side effects, and this means that different rules apply.
If data and methods are interchangeable, which should you choose? And, more importantly, when should you choose which? The most important difference is that data must be preevaluated, whereas a method can be lazy-evaluated. When you must evaluate data early, you must preevaluate the method and use the result as the data, rather than take a functional approach and substitute the method.
The most important criterion for deciding which to use is the possibility of side effects, both in the body of the function and in the mutability of its return value. Item 37 (earlier in this chapter) shows a query whose results are based on the current time. Its return value changes depending on whether you execute it and cache the results or you use the query as a function parameter. If the function itself produces side effects, the behavior of the program depends on when you execute the function.
There are techniques you can use to minimize the contrast between early and late evaluation. Pure immutable types cannot be changed, and they don’t change other program states; therefore, they are not subject to side effects. In the brief example earlier, if Method1, Method2, and Method3 are members of an immutable type, then the observable behavior of the early and the late evaluation statements should be exactly the same.
My example does not take any parameters, but if any of those late evaluation methods took parameters, those parameters would need to be immutable to ensure that the early and late binding results were the same.
Therefore, the most important point in deciding between early and late evaluation is the semantics that you want to achieve. If (and only if) the objects and methods are immutable, then the correctness of the program is the same when you replace a value with the function that calculates it, and vice versa. (“Immutable methods” in this case means that the methods cannot modify any global state, such as performing I/O operations, updating global variables, or communicating with other processes.) If the objects and methods are not immutable, you risk changing the program’s behavior by changing from early to late evaluation and vice versa. The rest of this item assumes that the observable behavior won’t change between early and late evaluation. We look at other reasons to favor one or the other strategy.
One decision point is the size of the input and output space versus the cost of computing the output. For example, programs would still work if Math.PI calculated pi when called. The value and the computation are interchangeable from the outside. However, programs would be slower because calculating pi takes time. On the other hand, a method CalculatePrimeFactors(int) could be replaced with a lookup table containing all factors of all integers. In that case, the cost of the data table in memory would likely be much greater than the cost in time of calculating the values when needed.
Your real-world problems probably fall somewhere between those two extremes. The right solution won’t be as obvious, nor will it be as clear-cut. In addition to analyzing the computational cost versus the storage cost, you need to consider how you will use the results of any given method. You will find that in some situations, early evaluation of certain queries will make sense. In other cases, you’ll use interim results only infrequently. If you ensure that the code does not produce side effects and that either early or deferred evaluation produces the correct answer, then you can make the decision based on the measured performance metrics of both solutions. You can try both ways, measure the difference, and use the best result.
Finally, in some cases, you may find that a mixture of the two strategies will work the best. You may find that caching sometimes provides the most efficiency. In those cases, you can create a delegate that returns the cached value:
MyType cache = Method1();
object answer = DoStuff(() => cache,
() => Method2(),
() => Method3());
The final decision point is whether the method can execute on a remote data store. This factor has quite a bearing on how LINQ to SQL processes queries. Every LINQ to SQL query starts as a deferred query: The methods, and not the data, are used as parameters. Some of the methods may involve work that can be done inside the database engine, and some of the work represents local methods that must be processed before the partially processed query is submitted to the database engine. LINQ to SQL parses the expression tree. Before submitting the query to the database engine, it replaces any local method calls with the result from those method calls. It can do this processing only if a method call does not rely on any individual items in the input sequence being processed (see Items 37 and 38, both in this chapter).
Once LINQ to SQL has replaced any local method calls with the equivalent return values, it translates the query from expressions into SQL statements, which are sent to the database engine and executed there. The result is that by creating a query as a set of expressions, or code, the LINQ to SQL libraries can replace those methods with equivalent SQL. That provides improved performance and lower bandwidth usage. It also means that you as a C# developer can spend less time learning T-SQL. Other providers can do the same.
However, all this work is possible only because you can treat data as code, and vice versa, under the right circumstances. With LINQ to SQL, local methods can be replaced with the return values when the parameters to the method are constants that do not rely on the input sequence. Also, there is quite a bit of functionality in the LINQ to SQL libraries that translates expression trees to a logical structure that can then be translated into T-SQL.
As you create algorithms in C# now, you can determine whether using the data as a parameter or the function as a parameter causes any difference in behavior. Once you’ve determined that either would be correct, you must determine which would be the better strategy. When the input space is smaller, passing data might be better. However, in other cases, when the input or output space may be very large and you don’t necessarily use the entire input data space, you may find that it’s much wiser to use the algorithm itself as a parameter. If you’re not sure, lean toward using the algorithm as a parameter, because the developer who implements the function can create that function to eagerly evaluate the output space and work with those data values instead.
Item 41: Avoid Capturing Expensive Resources
Closures create objects that contain bound variables. The length of the lives of those bound variables may surprise you, and not always in a good way. As developers we’ve grown accustomed to looking at the lifetimes of local variables in a very simple way: Variables come into scope when we declare them, and they are out of scope when the corresponding block closes. Local variables are eligible for garbage collection when they go out of scope. We use these assumptions to manage resource usage and object lifetimes.
Closures and captured variables change those rules. When you capture a variable in a closure, the object referenced by that variable does not go out of scope until the last delegate referencing that captured variable goes out of scope. Under some circumstances it may last even longer. After closures and captured variables escape one method, they can be accessed by closures and delegates in client code. Those delegates and closures can be accessed by other code, and so on. Eventually the code accessing your delegate becomes an open-ended set of methods with no idea when your closure and delegates are no longer reachable. The implication is that you really don’t know when local variables go out of scope if you return something that is represented by a delegate using a captured variable.
The good news is that often you don’t need to be concerned about this behavior. Local variables that are managed types and don’t hold on to expensive resources are garbage-collected at a later point, just as regular variables are. If the only thing used by local variables is memory, there’s no concern at all.
But some variables hold on to expensive resources. They represent types that implement IDisposable and need to be explicitly cleaned up. You may prematurely clean up those resources before you’ve actually enumerated the collection. You may find that files or connections aren’t being closed quickly enough, and you’re not able to access files because they are still open.
Item 33 (Chapter 4) shows you how the C# compiler produces delegates and how variables are captured inside a closure. In this item, we look at how to recognize when you have captured variables that contain other resources. We examine how to manage those resources and how to avoid pitfalls that can occur when captured variables live longer than you’d like.
Consider this construct:
int counter = 0;
IEnumerable<int> numbers =
Extensions.Generate(30, () => counter++);
It generates code that looks something like this:
private class Closure
{
public int generatedCounter;
public int generatorFunc()
{
return generatedCounter ++;
}
}
// usage
Closure c = new Closure();
c.generatedCounter = 0;
IEnumerable<int> sequence = Extensions.Generate(30, new Func<int>(c.generatorFunc));
This can get very interesting. The hidden nested class members have been bound to delegates used by Extensions.Generate. That can affect the lifetime of the hidden object and therefore can affect when any of the members are eligible for garbage collection. Look at this example:
public IEnumerable<int> MakeSequence()
{
int counter = 0;
IEnumerable<int> numbers = Extensions.Generate(30,
() => counter++);
return numbers;
}
In this code, the returned object uses the delegate that is bound by the closure. Because the return value needs the delegate, the delegate’s lifetime extends beyond the life of the method. The lifetime of the object representing the bound variables is extended. The object is reachable because the delegate instance is reachable, and the delegate is still reachable because it’s part of the returned object. And all members of the object are reachable because the object is reachable.
The C# compiler generates code that looks like this:
public static IEnumerable<int> MakeSequence()
{
Closure c = new Closure();
c.generatedCounter = 0;
IEnumerable<int> sequence = Extensions.Generate(30,
new Func<int>(c.generatorFunc));
return sequence;
}
Notice that this sequence contains a delegate reference to a method bound to c, the local object instantiating the closure. The local variable c lives beyond the end of the method.
Often, this situation does not cause much concern. But there are two cases in which it can cause confusion. The first involves IDisposable. Consider the following code. It reads numbers from a CSV input stream and returns the values as a sequence of sequences of numbers. Each inner sequence contains the numbers on that line. It uses some of the extension methods shown in Item 28 (Chapter 4).
public static IEnumerable<string> ReadLines(
this TextReader reader)
{
string txt = reader.ReadLine();
while (txt != null)
{
yield return txt;
txt = reader.ReadLine();
}
}
public static int DefaultParse(this string input,
int defaultValue)
{
int answer;
return (int.TryParse(input, out answer))
? answer : defaultValue;
}
public static IEnumerable<IEnumerable<int>>
ReadNumbersFromStream(TextReader t)
{
var allLines = from line in t.ReadLines()
select line.Split(',');
var matrixOfValues = from line in allLines
select from item in line
select item.DefaultParse(0);
return matrixOfValues;
}
You would use it like this:
TextReader t = new StreamReader("TestFile.txt");
var rowsOfNumbers = ReadNumbersFromStream(t);
Remember that queries generate the next value only when that value is accessed. The ReadNumbersFromStream() method does not put all the data in memory, but rather it loads values from the stream as needed. The two statements that follow don’t actually read the file. It’s only later when you start enumerating the values in rowsOfNumbers that you open the file and begin reading the values.
Later, in a code review, someone-say, that pedantic Alexander-points out that you never explicitly close the test file. Maybe he found it because there was a resource leak, or he found some error because the file was open when he tried to read it again. You make a change to fix that problem. Unfortunately, it doesn’t address the root concerns.
IEnumerable<IEnumerable<int>> rowOfNumbers;
using (TextReader t = new StreamReader("TestFile.txt"))
rowOfNumbers = ReadNumbersFromStream(t);
You happily start your tests, expecting success, but your program throws an exception a couple of lines later:
IEnumerable<IEnumerable<int>> rowOfNumbers;
using (TextReader t = new StreamReader("TestFile.txt"))
rowOfNumbers = ReadNumbersFromStream(t);
foreach (var line in rowOfNumbers)
{
foreach (int num in line)
Console.Write("{0}, ", num);
Console.WriteLine();
}
What happened? You tried to read from the file after you closed it. The iteration throws an ObjectDisposedException. The C# compiler bound TextReader to the delegate that reads and parses items from the file. That set of code is represented by the variable arrayOfNums. Nothing has really happened yet. The stream has not been read, and nothing has been parsed. That’s one of the issues that arise when you move the resource management back up to the callers. If those callers misunderstand the lifetimes of resources, they will introduce problems that range from resource leaks to broken code.
The specific fix is straightforward. You move the code around so that you use the array of numbers before you close the file:
using (TextReader t = new StreamReader("TestFile.txt"))
{
var arrayOfNums = ReadNumbersFromStream(t);
foreach (var line in arrayOfNums)
{
foreach (var num in line)
Console.Write("{0}, ", num);
Console.WriteLine();
}
}
That’s great, but not all your problems are that simple. This strategy will lead to lots of duplicated code, and we’re always trying to avoid that. So let’s look at this solution for some hints about what can lead to a more general answer. The foregoing piece of code works because it uses the array of numbers before the file is closed.
You’ve structured the code in such a way that it’s almost impossible to find the right location to close the file. You’ve created an API wherein the file must be opened in one location but cannot be closed until a later point. Suppose the original usage pattern were more like this:
using (TextReader t = new StreamReader("TestFile.txt"))
return ReadNumbersFromFile(t);
Now you’re stuck with no possible way to close the file. It’s opened in one routine, but somewhere up the call stack, the file needs to be closed. Where? You can’t be sure, but it’s not in your code. It’s somewhere up the call stack, outside your control, and you’re left with no idea even what the file name is and no stream handle to examine what to close.
One obvious solution is to create one method that opens the file, reads the sequence, and returns the sequence. Here’s a possible implementation:
public static IEnumerable<string> ParseFile(string path)
{
using (StreamReader r = new StreamReader(path))
{
string line = r.ReadLine();
while (line != null)
{
yield return line;
line = r.ReadLine();
}
}
}
This method uses the same deferred execution model I show you in Item 17 (Chapter 3). What’s important here is that the StreamReader object is disposed of only after all elements have been read, whether that happens early or later. The file object will be closed, but only after the sequence has been enumerated. Here’s a smaller contrived example to show what I mean.
class Generator : IDisposable
{
private int count;
public int GetNextNumber()
{
return count++;
}
#region IDisposable Members
public void Dispose()
{
Console.WriteLine("Disposing now ");
}
#endregion
}
The Generator class implements IDisposable, but only to show you what happens when you capture a variable of a type that implements IDisposable. Here’s one sample usage:
var query = (from n in SomeFunction()
select n).Take(5);
foreach (var s in query)
Console.WriteLine(s);
Console.WriteLine("Again");
foreach (var s in query)
Console.WriteLine(s);
Here’s the output from this code fragment:
0
1
2
3
4
Disposing now
Again
0
1
2
3
4
Disposing now
The Generator object is disposed of when you would hope: after you have completed the iteration for the first time. Generator is disposed of whether you complete the iteration sequence or you stop the iteration early, as this query does.
However, there is a problem here. Notice that “Disposing now” is printed twice. Because the code fragment iterated the sequence twice, the code fragment caused Generator to be disposed of twice. That’s not a problem in the Generator class, because that’s only a marker. But the file example throws an exception when you enumerate the sequence for the second time. The first enumeration finishes, and StreamReader gets disposed of. Then the second enumeration tries to access a stream reader that’s been disposed of. It won’t work.
If your application will likely perform multiple enumerations on a disposable resource, you need to find a different solution. You may find that your application reads multiple values, processing them in different ways during the course of an algorithm. It may be wiser to use delegates to pass the algorithm, or multiple algorithms, into the routine that reads and processes the records from the file.
You need a generic version of this method that will let you capture the use of those values and then use those values inside an expression before you finally dispose of the file. The same action would look like this:
// Usage pattern: parameters are the file
// and the action you want taken for each line in the file.
ProcessFile("testFile.txt",
(arrayOfNums) =>
{
foreach (IEnumerable<int> line in arrayOfNums)
{
foreach (int num in line)
Console.Write("{0}, ", num);
Console.WriteLine();
}
// Make the compiler happy by returning something:
return 0;
}
);
// declare a delegate type
public delegate TResult ProcessElementsFromFile<TResult>(
IEnumerable<IEnumerable<int>> values);
// Method that reads files, processing each line
// using the delegate
public static TResult ProcessFile<TResult>(string filePath,
ProcessElementsFromFile<TResult> action)
{
using (TextReader t = new StreamReader(filePath))
{
var allLines = from line in t.ReadLines()
select line.Split(',');
var matrixOfValues = from line in allLines
select from item in line
select item.DefaultParse(0);
return action(matrixOfValues);
}
}
This looks a bit complicated, but it is helpful if you find yourself using this data source in many ways. Suppose you need to find the global maximum in the file:
var maximum = ProcessFile("testFile.txt",
(arrayOfNums) =>
(from line in arrayOfNums
select line.Max()).Max());
Here, the use of the file stream is completely encapsulated inside ProcessFile. The answer you seek is a value, and it gets returned from the lambda expression. By changing the code so that the expensive resource (here, the file stream) gets allocated and released inside the function, you don’t have expensive members being added to your closures.
The other problem with expensive resources captured in closures is less severe, but it can affect your application’s performance metrics. Consider this method:
IEnumerable<int> ExpensiveSequence()
{
int counter = 0;
IEnumerable<int> numbers = Extensions.Generate(30,
() => counter++);
Console.WriteLine("counter: {0}", counter);
ResourceHog hog = new ResourceHog();
numbers = numbers.Union(
hog.SequenceGeneratedFromResourceHog(
(val) => val < counter));
return numbers;
}
Like the other closures I’ve shown, this algorithm produces code that will be executed later, using the deferred execution model. This means that ResourceHog lives beyond the end of this method to whenever client code enumerates the sequence. Furthermore, if ResourceHog is not disposable, it will live on until all roots to it are unreachable and the garbage collector frees it.
If this is a bottleneck, you can restructure the query so that the numbers generated from ResourceHog get evaluated eagerly and thus ResourceHog can be cleaned up immediately:
IEnumerable<int> ExpensiveSequence()
{
int counter = 0;
IEnumerable<int> numbers = Extensions.Generate(30,
() => counter++);
Console.WriteLine("counter: {0}", counter);
ResourceHog hog = new ResourceHog();
IEnumerable<int> mergeSequence =
hog.SequenceGeneratedFromResourceHog(
(val) => val < counter).ToList();
numbers = numbers.Union(mergeSequence);
return numbers;
}
This sample is pretty clear, because the code isn’t very complicated. If you have more-complicated algorithms, it can be quite a bit more difficult to separate the inexpensive resources from the expensive resources. Depending on how complicated your algorithms are in methods that create closures, it may be quite a bit more difficult to unwind different resources that are captured inside bound variables of the closure. The following method uses three different local variables captured in a closure.
private static IEnumerable<int> LeakingClosure(int mod)
{
ResourceHogFilter filter = new ResourceHogFilter();
CheapNumberGenerator source = new CheapNumberGenerator();
CheapNumberGenerator results = new CheapNumberGenerator();
double importantStatistic = (from num in
source.GetNumbers(50)
where filter.PassesFilter(num)
select num).Average();
return from num in results.GetNumbers(100)
where num > importantStatistic
select num;
}
At first examination, it appears fine. ResourceHog generates the important statistic. It’s scoped to the method, and it becomes garbage as soon as the method exits.
Unfortunately, this method is not as fine as it appears to be.
Here’s why. The C# compiler creates one nested class per scope to implement a closure. The final query statement-which returns the numbers that are greater than the important statistic-needs a closure to contain the bound variable, the important statistic. Earlier in the method, the filter needs to be used in a closure to create the important statistic. This means that the filter gets copied into the nested class that implements the closure. The return statement returns a type that uses an instance of the nested class to implement the where clause. The instance of the nested class implementing the closure has leaked out of this method. Normally you wouldn’t care. But if ResourceHogFilter really uses expensive resources, this would be a drain on your application.
To fix this problem, you need to split the method into two parts and get the compiler to create two closure classes:
private static IEnumerable<int> NotLeakingClosure(int mod)
{
var importantStatistic = GenerateImportantStatistic();
CheapNumberGenerator results = new CheapNumberGenerator();
return from num in results.GetNumbers(100)
where num > importantStatistic
select num;
}
private static double GenerateImportantStatistic()
{
ResourceHogFilter filter = new ResourceHogFilter();
CheapNumberGenerator source = new CheapNumberGenerator();
return (from num in source.GetNumbers(50)
where filter.PassesFilter(num)
select num).Average();
}
“But wait,” you say. “That return statement in GenerateImportantStatistic contains the query that generates the statistic. The closure still leaks.” No, it doesn’t. The Average method requires the entire sequence (see Item 40, earlier in this chapter). The enumeration happens inside the scope of GenerateImportantStatistic, and the average value is returned. The closure containing the ResourceHogFilter object can be garbage-collected as soon as this method returns.
I chose to rework the method in this way because even more issues arise when you write methods that have multiple logical closures. Even though you think that the compiler should create multiple closures, the compiler creates only one closure, which handles all the underlying lambdas in each scope. You care in cases when one of the expressions can be returned from your method, and you think that the other expression doesn’t really matter. But it does matter. Because the compiler creates one class to handle all the closures created by a single scope, all members used in any closures are injected into that class. Examine this short method:
public IEnumerable<int> MakeAnotherSequence()
{
int counter = 0;
IEnumerable<int> interim = Extensions.Generate(30,
() => counter++);
Random gen = new Random();
IEnumerable<int> numbers = from n in interim
select gen.Next() - n;
return numbers;
}
MakeAnotherSequence() contains two queries. The first one generates a sequence of integers from 0 through 29. The second modifies that sequence using a random number generator. The C# compiler generates one private class to implement the closure that contains both counter and gen. The code that calls MakeAnotherSequence() will access an instance of the generated class containing both local variables. The compiler does not create two nested classes, only one. The instances of that one nested class will be passed to callers.
There’s one final issue relating to when operations happen inside a closure. Here’s a sample.
private static void SomeMethod(ref int i)
{
//...
}
private static void DoSomethingInBackground()
{
int i = 0;
Thread thread = new Thread(delegate()
{ SomeMethod(ref i); });
thread.Start();
}
In this sample, you’ve captured a variable and examined it in two threads. Furthermore, you’ve structured it such that both threads are accessing it by reference. I’d explain more in a sample as to what happens to the value of i when you run this sample, but the truth is that it’s not possible to know what’s going to happen. Both threads can examine or modify the value of i, but, depending on which thread works faster, either thread could change the value at any time.
When you use query expressions in your algorithms, the compiler creates a single closure for all expressions in the entire method. An object of that type may be returned from your method, possibly as a member of the type implementing the enumeration. That object will live in the system until all users of it have been removed. That may create many issues. If any of the fields copied into the closure implements IDisposable, it can cause problems with correctness. If any of the fields is expensive to carry, it can cause performance problems. Either way, you need to understand that when objects created by a closure are returned from methods, the closure contains all the variables used to perform the calculations. You must ensure that you need those variables, or, if you can’t do that, ensure that the closure can clean them up for you.
Item 42: Distinguish Between IEnumerable and IQueryable Data Sources
IQueryable<T> and IEnumerable<T> have very similar API signa-tures. IQueryable<T> derives from IEnumerable<T>. You might think that these two interfaces are interchangeable. In many cases, they are, and that’s by design. In contrast, a sequence is a sequence, but sequences are not always interchangeable. Their behaviors are different, and their performance metrics can be very, very different. The following two query statements are quite different:
var q =
from c in dbContext.Customers
where c.City == "London"
select c;
var finalAnswer = from c in q
orderby c.Name
select c;
// Code to iterate the final Answer sequence elided
var q =
(from c in dbContext.Customers
where c.City == "London"
select c).AsEnumerable();
var finalAnswer = from c in q
orderby c.Name
select c;
// code to iterate final answer elided.
These queries return the same result, but they do their work in very different ways. The first query uses the normal LINQ to SQL version that is built on IQueryable functionality. The second version forces the database objects into IEnumerable sequences and does more of its work locally. It’s a combination of lazy evaluation and IQueryable<T> support in LINQ to SQL.
When the results of a query are executed, the LINQ to SQL libraries compose the results from all the query statements. In the example, this means that one call is made to the database. It also means that one SQL query performs both the where clause and the orderby clause.
In the second case, returning the first query as an IEnumerable<T> sequence means that subsequent operations use the LINQ to Objects implementation and are executed using delegates. The first statement causes a call to the database to retrieve all customers in London. The second orders the set returned by the first call by name. That sort operation occurs locally.
You should care about the differences because many queries work quite a bit more efficiently if you use IQueryable functionality than if you use IEnumerable functionality. Furthermore, because of the differences in how IQueryable and IEnumerable process query expressions, you’ll find that sometimes queries that work in one environment do not work in the other.
The processing is different at every step of the way. That’s because the types used are different. Enumerable<T> extension methods use delegates for the lambda expressions as well as the function parameters whenever they appear in query expressions. Queryable<T>, on the other hand, uses expression trees to process those same function elements. An expression tree is a data structure that holds all the logic that makes up the actions in the query. The Enumerable<T> version must execute locally. The lambda expressions have been compiled into methods, and they must execute now on the local machine. This means that you need to pull all the data into the local application space from wherever it resides. You’ll transfer much more data, and you’ll throw away whatever isn’t necessary.
In contrast, the Queryable version parses the expression tree. After examining the expression tree, this version translates that logic into a format appropriate for the provider and then executes that logic where it is closest to the data location. The result is much less data transfer and better overall system performance. However, there are some restrictions on the code that goes into query expressions when you use the IQueryable interface and rely on the Queryable<T> implementation of your sequence.
As I show earlier in this chapter in Item 37, IQueryable providers don’t parse any arbitrary method. That would be an unbounded set of logic. Instead, they understand a set of operators, and possibly a defined set of methods, that are implemented in the .NET Framework. If your queries contain other method calls, you may need to force the query to use the Enumerable implementation.
private bool isValidProduct(Product p) {
return p.ProductName.LastIndexOf('C') == 0;
}
// This works:
var q1 =
from p in dbContext.Products.AsEnumerable()
where isValidProduct(p)
select p;
// This throws an exception when you enumerate the collection.
var q2 =
from p in dbContext.Products
where isValidProduct(p)
select p;
The first query works, because LINQ to Objects uses delegates to implement queries as method calls. The AsEnumerable() call forces the query into the local client space, and the where clause executes using LINQ to Objects. The second query throws an exception. The reason is that LINQ to SQL uses an IQueryable<T> implementation. LINQ to SQL contains an IQueryProvider that translates your queries into T-SQL. That T-SQL then gets remoted to the database engine, and the database engine executes the SQL statements in that context (see Item 38 earlier in this chapter). That approach can give you an advantage, because far less data gets transferred across tiers and possibly across layers.
In a typical tradeoff of performance versus robustness, you can avoid the exception by translating the query result explicitly to an IEnumerable<T>. The downside of that solution is that the LINQ to SQL engine now returns the entire set of dbContext.Products from the database. Furthermore, the remainder of the query is executed locally. Because IQueryable<T> inherits from IEnumerable<T>, this method can be called using either source.
That sounds good, and it can be a simple approach. But it forces any code that uses your method to fall back to the IEnumerable<T> sequence. If your client developer is using a source that supports IQueryable<T>, you have forced her to pull all the source elements into this process’s address space, then process all those elements here, and finally return the results.
Even though normally you would be correct to write that method once, and write it to the lowest common class or interface, that’s not the case with IEnumerable<T> and IQueryable<T>. Even though they have almost the same external capabilities, the differences in their respective implementations mean that you should use the implementation that matches your data source. In practice, you’ll know whether the data source implements IQueryable<T> or only IEnumerable<T>. When your source implements IQueryable, you should make sure that your code uses that type.
However, you may occasionally find that a class must support queries on IEnumerable<T> and IQueryable<T> for the same T:
public static IEnumerable<Product>
ValidProducts(this IEnumerable<Product> products)
{
return from p in products
where p.ProductName.LastIndexOf('C') == 0
select p;
}
// OK, because string.LastIndexOf() is supported
// by LINQ to SQL provider
public static IQueryable<Product>
ValidProducts(this IQueryable<Product> products)
{
return from p in products
where p.ProductName.LastIndexOf('C') == 0
select p;
}
Of course, this code reeks of duplicated effort. You can avoid the duplication by using AsQueryable() to convert any IEnumerable<T> to an IQueryable<T>:
public static IEnumerable<Product>
ValidProducts(this IEnumerable<Product> products)
{
return from p in products.AsQueryable()
where p.ProductName.LastIndexOf('C') == 0
select p;
}
AsQueryable() looks at the runtime type of the sequence. If the sequence is an IQueryable, it returns the sequence as an IQueryable. In contrast, if the runtime type of the sequence is an IEnumerable, then AsQueryable() creates a wrapper that implements IQueryable using the LINQ to Objects implementation, and it returns that wrapper. You get the Enumerable implementation, but it’s wrapped in an IQueryable reference.
Using AsQueryable() gives you the maximum benefit. Sequences that already implement IQueryable will use that implementation, and sequences that support only IEnumerable will still work. When client code hands you an IQueryable sequence, your code will properly use the Queryable<T> methods and will support expression trees and foreign execution. And if you are working with a sequence that supports only IEnumerable<T>, then the runtime implementation will use the IEnumerable implementation.
Notice that this version still uses a method call: string.LastIndexOf(). That is one of the methods that are parsed correctly by the LINQ to SQL libraries, and therefore you can use it in your LINQ to SQL queries. However, every provider has unique capabilities, so you should not consider that method available in every IQueryProvider implementation.
IQueryable<T> and IEnumerable<T> might seem to provide the same functionality. All the difference lies in how each implements the query pattern. Make sure to declare query results using the type that matches your data source. Query methods are statically bound, and declaring the proper type of query variables means that you get the correct behavior.
Item 43: Use Single() and First() to Enforce Semantic Expectations on Queries
A quick perusal of the LINQ libraries might lead you to believe that they have been designed to work exclusively with sequences. But there are methods that escape out of a query and return a single element. Each of these methods behaves differently from the others, and those differences help you express your intention and expectations for the results of a query that returns a scalar result.
Single() returns exactly one element. If no elements exist, or if multiple elements exist, then Single() throws an exception. That’s a rather strong statement about your expectations. However, if your assumptions are proven false, you probably want to find out immediately. When you write a query that is supposed to return exactly one element, you should use Single(). This method expresses your assumptions most clearly: You expect exactly one element back from the query. Yes, it fails if your assumptions are wrong, but it fails quickly and in a way that doesn’t cause any data corruption. That immediate failure helps you make a quick diagnosis and correct the problem. Furthermore, your application data doesn’t get corrupted by executing later program logic using faulty data. The query fails immediately, because the assumptions are wrong.
var somePeople = new List<Person>{
new Person {FirstName = "Bill", LastName = "Gates"},
new Person { FirstName = "Bill", LastName = "Wagner"},
new Person { FirstName = "Bill", LastName = "Johnson"}};
// Will throw an exception because more than one
// element is in the sequence
var answer = (from p in somePeople
where p.FirstName == "Bill"
select p).Single();
Furthermore, unlike many of the other queries I’ve shown you, this one throws an exception even before you examine the result. Single() immediately evaluates the query and returns the single element. The following query fails with the same exception (although a different message):
var answer = (from p in somePeople
where p.FirstName == "Larry"
select p).Single();
Again, your code assumes that exactly one result exists. When that assumption is wrong, Single() always throws an InvalidOperationException.
If your query can return zero or one element, you can use SingleOrDefault(). However, remember that SingleOrDefault() still throws an exception when more than one value is returned. You are still expecting no more than one value returned from your query expression.
var answer = (from p in somePeople
where p.FirstName == "Larry"
select p).SingleOrDefault();
This query returns null (the default value for a reference type) to indicate that there were no values that matched the query.
Of course, there are times when you expect to get more than one value but you want a specific one. The best choice is First() or FirstOrDefault(). Both methods return the first element in the returned sequence. If the sequence is empty, the default is returned. The following query finds the forward who scored the most goals, but it returns null if none of the forwards has scored any goals.
// Works. Returns null
var answer = (from p in Forwards
where p.GoalsScored > 0
orderby p.GoalsScored
select p).FirstOrDefault();
// throws an exception if there are no values in the sequence:
var answer2 = (from p in Forwards
where p.GoalsScored > 0
orderby p.GoalsScored
select p).First();
Of course, sometimes you don’t want the first element. There are quite a few ways to solve this problem. You could reorder the elements so that you do get the correct first element. (You could put them in the other order and grab the last element, but that would take somewhat longer.)
If you know exactly where in the sequence to look, you can use Skip and First to retrieve the one sought element. Here, we find the third-best goal-scoring forward:
var answer = (from p in Forwards
where p.GoalsScored > 0
orderby p.GoalsScored
select p).Skip(2).First();
I chose First() rather than Take() to emphasize that I wanted exactly one element, and not a sequence containing one element. Note that because I use First() instead of FirstOrDefault(), the compiler assumes that at least three forwards have scored goals.
However, once you start looking for an element in a specific position, it’s likely that there is a better way to construct the query. Are there different properties you should be looking for? Should you look to see whether your sequence supports IList<T> and supports index operations? Should you rework the algorithm to find exactly the one item? You may find that other methods of finding results will give you much clearer code.
Many of your queries are designed to return one scalar value. Whenever you query for a single value, it’s best to write your query to return a scalar value rather than a sequence of one element. Using Single() means that you expect to always find exactly one item. SingleOrDefault() means zero or one item. First and Last mean that you are pulling one item out of a sequence. Using any other method of finding one item likely means that you haven’t written your query as well as you should have. It won’t be as clear for developers using your code or maintaining it later.
Item 44: Prefer Storing Expression<> to Func<>
In Item 42 (earlier in this chapter) I briefly discuss how query providers such as LINQ to SQL examine queries before execution and translate them into their native format. LINQ to Objects, in contrast, implements queries by compiling lambda expressions into methods and creating delegates that access those methods. It’s plain old code, but the access is implemented through delegates.
LINQ to SQL (and any other query provider) performs this magic by asking for query expressions in the form of a System.Linq.Expressions.Expression object. Expression is an abstract base class that represents an expression. One of the classes derived from Expression is System.Linq.Expressions.Expression<TDelegate>, where TDelegate is a delegate type. Expression<TDelegate> represents a lambda expression as a data structure. You can analyze it by using the Body, NodeType, and Parameters properties. Furthermore, you can compile it into a delegate by using the Expression<TDelegate>.Compile() method. That makes Expression<TDelegate> more general than Func<T>. Simply put, Func<T> is a delegate that can be invoked. Expression<TDelegate> can be examined, or it can be compiled and then invoked in the normal way.
When your design includes the storage of lambda expressions, you’ll have more options if you store them using Expression<T>. You don’t lose any features; you simply have to compile the expression before invoking it:
Expression<Func<int, bool>> compound = val =>
(val % 2 == 1) && (val > 300);
Func<int, bool> compiled = compound.Compile();
Console.WriteLine(compiled(501));
The Expression class provides methods that allow you to examine the logic of an expression. You can examine an expression tree and see the exact logic that makes up the expression. The C# team provides a reference implementation for examining an expression with the C# samples delivered with Visual Studio 2008. The Expression Tree Visualizer sample, which includes source code, provides code that examines each node type in an expression tree and displays the contents of that node. It recursively visits each subnode in the tree; this is how you would examine each node in a tree in an algorithm to visit and modify each node.
Working with expressions and expression trees instead of functions and delegates can be a better choice, because expressions have quite a bit more functionality: You can convert an Expression to a Func, and you can traverse expression trees, meaning that you can create modified versions of the expressions. You can use Expression to build new algorithms at runtime, something that is much harder to do with Func.
This habit helps you by letting you later combine expressions using code. In this way, you build an expression tree that contains multiple clauses. After building the code, you can call Compile() and create the delegate when you need it.
Here is one way to combine two expressions to form a larger expression:
Expression<Func<int, bool>> IsOdd = val => val % 2 == 1;
Expression<Func<int, bool>> IsLargeNumber = val => val > 300;
InvocationExpression callLeft = Expression.Invoke(IsOdd, Expression.Constant(5));
InvocationExpression callRight = Expression.Invoke(
IsLargeNumber,
Expression.Constant(5));
BinaryExpression Combined =
Expression.MakeBinary(ExpressionType.And,
callLeft, callRight);
// Convert to a typed expression:
Expression<Func<bool>> typeCombined =
Expression.Lambda<Func<bool>>(Combined);
Func<bool> compiled = typeCombined.Compile();
bool answer = compiled();
This code creates two small expressions and combines them into a single expression. Then it compiles the larger expression and executes it. If you’re familiar with either CodeDom or Reflection.Emit, the Expression APIs can provide similar metaprogramming capabilities. You can visit expressions, create new expressions, compile them to delegates, and finally execute them.
Working with expression trees is far from simple. Because expressions are immutable, it’s a rather extensive undertaking to create a modified version of an expression. You need to traverse every node in the tree and either (1) copy it to the new tree or (2) replace the existing node with a different expression that produces the same kind of result. Several implementations of expression tree visitors have been written, as samples and as open source projects. I don’t add yet another version here. A Web search for “expression tree visitor” will find several implementations.
The System.Linq.Expressions namespace contains a rich grammar that you can use to build algorithms at runtime. You can construct your own expressions by building the complete expression from its components. The following code executes the same logic as the previous example, but here I build the lambda expression in code:
// The lambda expression has one parameter:
ParameterExpression parm = Expression.Parameter(
typeof(int), "val");
// We'll use a few integer constants:
ConstantExpression threeHundred = Expression.Constant(300, typeof(int));
ConstantExpression one = Expression.Constant(1, typeof(int));
ConstantExpression two = Expression.Constant(2, typeof(int));
// Creates (val > 300)
BinaryExpression largeNumbers =
Expression.MakeBinary(ExpressionType.GreaterThan,
parm, threeHundred);
// creates (val % 2)
BinaryExpression modulo = Expression.MakeBinary(
ExpressionType.Modulo,
parm, two);
// builds ((val % 2) == 1), using modulo
BinaryExpression isOdd = Expression.MakeBinary(
ExpressionType.Equal,
modulo, one);
// creates ((val % 2) == 1) && (val > 300),
// using isOdd and largeNumbers
BinaryExpression lambdaBody =
Expression.MakeBinary(ExpressionType.AndAlso,
isOdd, largeNumbers);
// creates val => (val % 2 == 1) && (val > 300)
// from lambda body and parameter.
LambdaExpression lambda = Expression.Lambda(lambdaBody, parm);
// Compile it:
Func<int, bool> compiled = lambda.Compile() as Func<int, bool>;
// Run it:
Console.WriteLine(compiled(501));
Yes, using Expression to build your own logic is certainly more complicated than creating the expression from the Func<> definitions shown earlier. This kind of metaprogramming is an advanced topic. It’s not the first tool you should reach for in your toolbox.
Even if you don’t build and modify expressions, libraries you use might do so. You should consider using Expression<> instead of Func<> when your lambda expressions are passed to unknown libraries whose implementations might use the expression tree logic to translate your algorithms into a different format. Any IQueryProvider, such as LINQ to SQL, would perform that translation.
Also, you might create your own additions to your type that would be better served by expressions than by delegates. The justification is the same: You can always convert expressions into delegates, but you can’t go the other way.
You may find that delegates are an easier way to represent lambda expressions, and conceptually they are. Delegates can be executed. Most C# developers understand them, and often they provide all the functionality you need. However, if your type will store expressions and passing those expressions to other objects is not under your control, or if you will compose expressions into more-complex constructs, then you should consider using expressions instead of funcs. You’ll have a richer set of APIs that will enable you to modify those expressions at runtime and invoke them after you have examined them for your own internal purposes.