LINQ Framework Design Guidelines

Mircea Trofin (Program Manager at Microsoft, Redmond)  has written a nice article on LINQ Framework Design Guidelines.

Writing applications that interact with data sources, such as databases, XML documents, or Web Services such as Flickr or Amazon, is made easier in the .NET Framework 3.5 with the addition of a set of features collectively referred to as LINQ (Language-Integrated Query). In what follows, we start with a very brief overview of LINQ, followed by guidelines for designing APIs in relation to LINQ.

1.     A Brief Overview of LINQ

Quite often, programming requires processing over sets of values. Some probably well known examples include: extracting the list of the most recently added books from a database of products; or finding the email address of a person in a directory service such as Active Directory; or transforming parts of an XML document to HTML to allow for web publishing; or something as frequent as looking up a value in a hash table.

LINQ allows for a uniform, language-integrated programming model with data, independent of the technology used to store that data.

In terms of concrete libraries, LINQ is embodied as:

·         A specification in C# and VB of the notion of “extension methods”

·         The Query Pattern, a specification of the set of methods a type must provide in order to be considered as a LINQ provider. The names of these methods can be found in System.Linq.Enumerable (found in System.Core.dll). Their signature follows a pattern that will be discussed later in this chapter.

·         A definition of a new interface, IQueryable<T>:IEnumerable<T>

·         New types representing generic delegates: Func<> and Action<>

·         Lambda expressions, a language feature in VB and C# for defining anonymous delegates

·         A type representing the notion of a delay-compiled delegate, the Expression<> family of types.

·         An extension to the C# and VB syntax allowing for queries to be expressed in an alternative, SQL-like format. For example (C#):

//using extension methods:

IEnumerable<string> names = set.Where(x=>x.Age>20).Select(x=>x.Name);

//using SQL-like syntax:

IEnumerable<string> names = from x in set where x.Age>20

                             select x.Name;

The interplay between these features is the following: any IEnumerable<> can be queried upon using the LINQ extension methods, most of which require one or more lambda expressions as parameters; this leads to an in-memory, generic evaluation of the queries. For cases where the set of data is not in memory (e.g. in a database), and/or queries may be optimized, the set of data is presented as an IQueryable<>. If lambda expressions are given as parameters, they are transformed by the compiler to Expression<> objects. The implementation of IQueryable<> is responsible for processing said expressions. For example, the implementation of an IQueryable<> representing a database table would translate Expression<> objects to SQL queries.

2.     Extension Methods, Func<>, Action<>, and Expression<>

Extension methods constitute a language feature that allows static methods to be invoked on instance variables. These methods must take at least one parameter, which represents the instance the method is to operate on. For example, in C#, this is done by using the this modifier on such a parameter, when defining the method:

public static bool IsPalindrome(this string s){

     //implementation follows here

}

This example allows one to write, for example:

“some string”.IsPalindrome();

…which in this case would result in false being displayed on the console.

The class that defines such extension methods is referred to in this text as “sponsor” class, and it must be declared as static. To use extension methods, one must import the namespace defining their sponsor class.

Func<> objects represent a generic delegate. For example:

Func<int,int,double> divide=(x,y)=>(double)x/(double)y;

Console.WriteLine(divide(2,3));

In this example, divide is a function that takes two integers and returns a double. The last parameter in the generic definition of a Func<> is always the return type. To represent void-returning functions, use Action<> instead. For example:

Action<double> write=(aDouble)=>Console.WriteLine(aDouble);

write(divide(2,3));

Expression<> objects represent function definitions that can be compiled and subsequently invoked at runtime. Continuing with our example:

Expression<Func<int,int,double>> divideBody=(x,y)=>(double)x/(double)y;

Func<int,int,double> divide2=divideBody.Compile();

write(divide2(2,3));

Notice how the syntax for constructing an Expression<> object is very similar to the one used to construct a Func<> object; in fact, the only difference is the static type declaration of the variable (Expression<> instead of Func<>).

Expressions may also be constructed programmatically, using the predefined static methods of the non-generic Expression class, however, this is a topic that does not need to be covered for our purposes here.

ý Avoid frivolous use of the extension methods feature when defining methods on a new type. Use the canonical, language-specific means for defining type members.

There are a few reasons for this. The main one is that liberal use of extension methods has the potential of cluttering the API of types. Another reason is that extension methods are a compile-time facility, and not all languages provide support for them.

There are, of course, scenarios in which extension methods should be employed. These are outlined in what follows.

þ Consider using extension methods in any of the following scenarios:

·     to provide helper functionality relevant to every implementation of an interface, if said functionality can be written in terms of the core interface. This is because concrete implementations cannot otherwise be assigned to interfaces. For example, the LINQ to Objects operators are implemented as extension methods for all IEnumerable<T> types. Thus, any IEnumerable<> implementation is automatically LINQ-enabled

·     when object model considerations would dictate taking a dependency on some assembly, but taking such a dependency would break dependency management rules.

For example, consider the domain of telecommunications. Let us consider that a core domain-specific model is that of network nodes. A library is defined for this domain, called nodes.dll. The library contains types such as Node. Node may have properties such as “address” and “manufacturer”, etc. Now consider the problem of establishing communication paths through such a network (i.e. routing). A second library, routing.dll, is thus developed, taking a dependency on nodes.dll. Good engineering practice dictates that nodes.dll not take a dependency on routing.dll.  However, it could be imagined that, in the domain of path routing, the model for network nodes may have new characteristics, namely, that of being or not a path endpoint, and that of being associated with a number of paths. To maintain clean dependencies, a possibility is to have static methods defined in routing.dll, e.g. Paths.IsNodePathEndpoint(Node n) or Paths.GetListOfPathsCrossingNode(Node n). However, it may be considered more natural to simply have methods like IsEndpoint() defined on Node. Defining extension methods on the Node type in the paths library is, thus, the option to be considered.

ý Avoid defining extension methods on System.Object, unless absolutely necessary. When doing so, be aware that VB users will not be able to use thus-defined extension methods and, as such, they will not be able to take advantage of usability/syntax benefits that come with extension methods.

This is because, in VB, declaring a variable as object forces all method invocations on it to be late bound – while bindings to extension methods are compile-time determined (early bound). For example:

public static class SomeExtensions{

              static void Foo(this object o){…}

}

Object o = …

o.Foo();

In this example, the call to Foo will fail in VB. Instead, the VB syntax should simply be:

SomeExtensions.Foo(o)

 

Note that the guideline applies to other languages where the same binding behavior is present, or where extension methods are not supported.

ý Do not put extension methods in the same namespace as the extended type, unless it is for adding methods to interfaces, or for dependency management. In the latter case, the type would be in a different assembly.

ý Avoid redefining extension methods on a type T with extension methods on the same type.

For example, in C#, if two different namespaces defined the same extension method on the same type, it would be impossible to import both namespaces in the same file – the compiler would report an ambiguity.

<file 1>

namespace A{

              public static class ExtMethodsInA{

                     public static void ExtMethod(this T obj){…}

              }

}

<file 2>

namespace B{

              public static class ExtMethodsInB{

                     static void ExtMethod(this T obj){…}

              }

}

<file 3>

using A;

using B;

This will fail at compile time, since ExtMethod is considered to be defined twice. This may break valid reusability scenarios; if the scenarios occur late in the development cycle, refactoring namespaces may prove costly. It is thus preferable to avoid this form of overriding methods.

  

Mircea Trofin

Such a situation may occur when using a mix of third-party libraries. To resolve it, import in a file only one namespace defining extension methods on a type, and use fully-qualified static method calls for extension methods defined on the same type in a different namespace.

Using the example before:

<file 3>

using A;

T someObj=…

someObj.ExtMethod(); //this calls A.ExtMethodsInA.ExtMethod

B.      B.ExtMethodsInB.ExtMethod(someObj); //to avoid compilation errors of file 3, we call explicitly the extension method defined in namespace B

 

þ Do define extension methods in the same namespace as the extended type, if the type is an interface, and if the extension methods are meant to be used in most or all cases.

ý Do not define extension methods pertaining to a feature in namespaces normally associated with other features. Instead, define them in the namespace associated with the feature they belong to, or a namespace of it.

For example, do not define extension methods for Telecom.Node in the Telecom namespace, but in a separate namespace, e.g. Routing.

ý Avoid generic naming of namespaces dedicated to extension methods (e.g. “Extensions”) – rather, use a descriptive name – e.g. “Routing”.

þ Do use the new LINQ types “Func<>” and “Expression<>” instead of custom delegates and predicates, when defining new APIs. The example below illustrates this point:

//rather than:

delegate bool Tester(int i);

class AClass{

              public Tester MyTester{get;set;}

}

//use this:

class AClass{

              public Func<int,bool> MyTester{get;set;}

}

Benefits of using Func<>, in this case, include minimizing the concept count by employing an existing abstraction, as well as eliminating the dependency on the definition of Tester.

3.     Extending on LINQ

There are three means by which a type can be designed to participate in LINQ queries: implementing IEnumerable<> (or interfaces derived from it), implementing IQueryable<>, or by defining the Query Pattern on the type, with no relationship to these two interfaces.

To preface the guidelines, the choice can be made as follows:

–          Extend IEnumerable<> (or interfaced derived from it) if the default, LINQ to Objects support provided in the framework is sufficient

o   Override some/all LINQ methods if optimizations are required, but access to the query expression is not necessary

–          Extend IQueryable<> if access to the query expression is necessary

–          Define the Query Pattern on the new type if the domain makes it impossible or undesirable to implement IEnumerable<> or IQueryable<>

The query methods are those defined by the sponsor class System.Linq.Enumerable, as extension methods to IEnumerable<>. Regardless of the means by which a type participates in LINQ queries, the following guideline applies:

þ Do respect the signature pattern in Figure 1 when overriding LINQ methods.

We use S with or without subscript to indicate a collection type (e.g. IEnumerable<>, ICollection<>) and T, with or without subscript, to indicate the type of elements in that collection. Additionally, we use O<T> to represent subtypes of S<T> that are ordered. For example, S<T> is a notation that could be substituted with IEnumerable<int>. The first parameter of all the methods is the type of the object the method is applied on, and it is marked accordingly by prefixing with “this” – whether one chooses to implement these methods as extension methods or as member methods. Also, anywhere Func<> is being used, one may substitute accordingly with Expression<Func<>> – the guidelines indicate where that is preferable.

Please note that the notation is not meant to be valid code in any particular language, but simply present the type signature pattern.

S<T> Where(this S<T>, Func<T,bool>)

S<T2> Select(this S<T1>,Func<T1,T2>)

 

O<T> OrderBy(this S<T>, Func<T,K>), where K is  IComparable

 

O<T> ThenBy(this O<T>,Func<T,K>), where K is  IComparable

 

S<T> Union(this S<T>, S<T>)

 

S<T> Skip(this S<T>,int)

S<T> Take(this S<T>,int)

S<T> SkipWhile(this S<T>, Func<T,bool>)

 

T ElementAt(this S<T>,int)

 

S<T3> Join(this S<T1>, S<T2>,

 Func<T1,K1>,Func<T2,K2>,

 Func<T1,T2,T3>)

 

S<T3> SelectMany(this   S<T1>,Func<T1,S<T2>>,Func<T1,T2,T3>)

 

S<T2> SelectMany(this S<T1>,Func<T1,S<T2>>)

Figure 1 Query Pattern Method Signatures

Extending IEnumerable<>

þ Do implement IEnumerable<T> to enable basic LINQ support. Such basic support should be sufficient for most in-memory data sets. The basic LINQ support will use the default -provided extension methods on IEnumerable<T>.

For example, simply defining:

public class RangeOfInts:IEnumerable<int>{

              public IEnumerator<int> GetEnumerator() {…}

              public IEnumerator GetEnumerator() {…}

}

Allows for the following code:

p var a=new RangeOfInts();

var b=a.Where(x=>x>10);

 

Although RangeOfInts did not implement a “Where” method.

þ Consider redefining the methods in the LINQ pattern on new types implementing IEnumerable<T>, if it is desirable to override the default implementation – e.g. for optimization. It is preferred that one redefines them as type members, rather than extension methods.

For example, rather than:

public class MyDataSet<T>:IEnumerable<T>{…}

public static class MyDataSetExtensions{

  public static MyDataSet<T> Where(this MyDataSet<T> o,Func<T,bool> f){…}

};

Opt for this:

public class MyDataSet<T>:IEnumerable<T>{

  public MyDataSet<T> Where(Func<T,bool> f){…}

}

þ Consider implementing ICollection<T> to improve performance of query operators.

For example, the Count<> method default behavior (as defined in System.Linq) is to simply walk the IEnumerable. Collection types can optimize their implementation of this method, since they typically offer an O(1) – complexity mechanism for finding the size of the collection.

Extending IQueryable<>

þ Consider implementing IQueryable<T> when access to the query expression is necessary. For example, this applies for types representing potentially large data sets generated by remote processes, or for certain optimizations. An example of such a dataset is a database, the filesystem, or an in-memory dictionary that needs to be optimized for searching  for values (rather than keys).

The following guidelines detail this further:

ý Do not implement IQueryable<T> without understanding all the performance tradeoffs.

·     Building and parsing expression trees is expensive and many queries can actually get slower when IQueryable<T> is implemented.

·     The trade-off is acceptable in the LINQ to SQL case, since the alternative overhead of performing queries in memory would have been far greater than the transformation of the expression to an SQL statement and the delegation of the query processing to the database server.

þ Do throw NotSupportedException from IQueryable<T> methods that cannot be logically supported by your data source.

For example, imagine representing a media stream (like internet radio) as an IQueryable<>. The Count method is not logically supported – the stream can be considered as infinite.

Implementing the Query Pattern

The Query Pattern refers to defining the methods in Figure 1 without implemeting the IEnumerable<> or IQueryable<> interfaces.

þ Do implement at least the enumerable pattern (i.e. provide a GetEnumerator method) for types representing collections of data. This is because the expectation is that the result of the query methods be foreach-able.

þ Do implement the Query Pattern as instance members on the new type, if the members make sense on the type even outside of the context of LINQ. Otherwise, use extension methods.

þ Do represent ordered sequences as a separate type. Define on this type the “ThenBy” method.

This follows the current pattern in the LINQ-to objects implementation, as well as allows for early (compile-time) detection of errors such as applying “ThenBy” to a not “OrderBy”-ed sequence.

For  example, the framework provides the IOrderedEnumerable<> type, which is returned by “OrderBy”. The “ThenBy” extension method is defined for this type, and not for IEnumerable<>.

þ Consider designing the LINQ operators to return specific enumerable types for domains that should be restricted. Essentially, one is free to return anything from a Select query method, however, the expectation is that the query result type should be at least enumerable in a foreach loop – see the example below, as well as the next guideline:

var set1=…

var set2=set1.Select(x=>x.SomeIntProperty);

foreach(int i in set2){…}

þ Do defer execution of query operator implementations. The expected behavior of most of the Query Pattern members is that they simply construct a new object which, upon enumeration, produces the elements of the set which match the query. The evaluation time is, then, at enumeration.

The following methods are exception from this rule: All, Any, Average, Contains, Count, ElementAt, Empty, First, FirstOrDefault, Last, LastOrDefault, Max, Min, Single, Sum.

In the example above, the expectation is that the time necessary for evaluating the second line be be independent from the size or nature (e.g. in-memory or remote server) of set1. The general expectation is that this line simply prepares set2, delaying the determination of its composition to the time of its enumeration.

 ý Avoid implementing just a part of the Query Pattern, if fallback to the basic IEnumerable<T> implementations is undesirable.

For example, consider a user-defined type T, which implements IEnumerable<>. T has an override for Count, but not for Where. Consider then the following example:

var query=someT.Where(f2).Count();

 

In this example, any opportunities for optimization are lost after the Where call. The Count version used is the one defined for IEnumerable<>.

þ Do place query extensions methods in a “Linq” sub-namespace of the main namespace. For example, extension methods for System.Data features reside in System.Data.Linq namespace.

þ Do use Expression<Func<>> as a parameter instead of Func<> when it is necessary to inspect the query.

As it was discussed, interacting with an SQL database is already done through IQueryable<T> rather than IEnumerable<T>, since this gives an opportunity to translate lambda expressions to SQL expressions.

An alternative reason for implementing IQueryable<T> is performing optimizations. For example, a sorted list can implement lookup (“Where” clauses) with binary search, which can be much more efficient than the standard IEnumerable<T> or IQueryable<T> implementations.

Source: Mircea Trofin’s blog