Tuesday, May 24, 2011

SharePoint Search 2007 – hacking the SiteData web service – Part I

When I started preparing this posting I realized that it would be too long, so I decided to split it into two parts – the first one being more introductory and explaining some aspects of the inner workings of the SharePoint search engine, and the second one concentrating on the actual implementation of the “hack”. Then when I started the first part, which you are now reading, I felt that the posting’s title itself already raises several questions, so it would be a good idea to start with a brief Q & A which will help you get into the discussed matter. This is a short list of questions that you may have also asked yourself two sentences into the posting:

  1. What is the relation between the SharePoint search service and the SharePoint SiteData web service in the first place?
  2. Why would I need to change the working of the SharePoint search, what can be the reasons and motives for that?
  3. Is it a wise idea and would you recommend using this hack?

And the answers come promptly:

  1. To answer this one we need to have a closer look at the internal workings of the SharePoint search index engine. If you are not familiar with some core concepts and basic terminology like index engine, content sources, filter daemon, protocol handlers, IFilters, I would recommend that you first check these two MSDN articles – here (for a high level architecture overview) and here (for a high level overview of the protocol handlers). Let me start with several words about the protocol handlers – these are basically responsible for crawling the different types of content sources. They are implemented as COM components written in unmanaged code (C or C++). If you are familiar with COM Interop you will know that it is also possible to create COM components using .NET and managed code and in fact there is a sample .NET protocol handler project in CodePlex. I am not sure though how wise it is to create your own protocol handler with managed code (apart from the fact that it is quite complex to start with) knowing that all existing protocol handlers by Microsoft and third party vendors are written in unmanaged code.
    You can check the available index engine protocols and their matching protocols handlers for your SharePoint installation in the Windows registry:

    search-ph

    You can see that there are different protocol handlers for the different types of content sources – SharePoint sites, external web sites, file shares, BDC, etc. The name of the protocol handler (the “Data” column in the image above) is actually the ProgID (in COM terms) of the COM component that implements the handler.
    In this posting we are only interested in just one of the protocol handlers – this is the one for the Sts3 protocol, which is responsible for crawling the content from SharePoint sites. The same handler is also used for the Sts3s protocol (see the image) which is again for SharePoint sites but which use the HTTPS (SSL) scheme. And now the interesting part – how does the Sts3 protocol handler traverse the content from SharePoint. The answer is actually also the answer of the first question in the list above – it calls the standard SharePoint SiteData web service (/_vti_bin/SiteData.asmx). If you wonder why for instance it doesn’t use the SharePoint object model directly – the main reason I think is for greater scalability (not to mention that it would be at best challenging to call managed from unmanaged code). The better scalability comes from the fact that the handler can be configured to call the SiteData web service from all available web front servers in the SharePoint farm, which can distribute better the workload and utilize better the resources of the farm. Later in the posting I will give you more details about how you can check and monitor the calls to the SiteData web service from the crawl engine and also some additional information about the exact methods of the SiteData service that are used for traversing the content of the SharePoint sites.
  2. As I already mentioned in the answer for the first question, this posting deals specifically with the search functionality that targets SharePoint content. So, the motives to come to this hack are directly related to the using and querying of SharePoint data. The reasons and motives for these changes can be separated in two groups – the first one is more general - why use SharePoint search and not some other available alternative method. The second one is more specific – what is not available or well implemented in the SharePoint search query engine that needs to be changed or improved.
    Let me start with the first group – out of the available methods to query and aggregate SharePoint content in the form of SharePoint list item and document metadata – SharePoint search doesn’t even come as the first or preferred option. Normally you would use the SharePoint object model with the SPList, SPListItem and SPQuery classes (for a single SharePoint list) or the SPSiteDataQuery class with the SPWeb.GetSiteData method (or alternatively the CrossListQueryInfo and CrossListQueryCache classes if you use the publishing infrastructure) – for querying and retrieving data from many lists in one site collection. The cross list query functionality is actually directly used in the standard SharePoint content by query web part (CQWP), so even without using custom code you may have experienced certain issues with it. Probably the biggest one is performance – maybe you’ve never seen it or you are well aware of it. This is because it becomes a real issue only if the size of your site collection in terms of the number of sub-sites becomes very big and you use queries that aggregate data from most of the available sub-sites. You can add to these two conditions the number of list items in the individual SharePoint lists which further degrades the performance. So, when does this become a visible issue – you can have various combinations of the above said conditions, but if you query more than one hundred sub-sites and/or you have more than several thousand items in every list (or many of the lists) you may see page loading times ranging from several seconds to well above a minute in certain extreme cases. And … this is an issue even with the built-in caching capabilities of the cross list query classes. As to why the caching doesn’t solve always the performance issue – there are several reasons (and cases) for that: first – there’re specific CAML queries for which the caching is not used at all (e.g. queries that contain the <UserID /> element); secondly – even if the caching works well, you have the first load that populates the cache which will still be slow, etc.
    Let me now briefly explain why the cross list query has such performance issues (only in the above mentioned cases). The main reason is the fact that the content database contains all list data (all list items in the whole site collection, it may also contain more than one site collections) in a highly denormalized table called AllUserData. This design solution was totally deliberate because it allows all flexibility that we know with SharePoint lists in terms of the ability to add, modify and customize fields, which unfortunately comes with a price in some rare cases like this one. Let’s see how the cross list query works from a database perspective with a real example – let’s say that we have a site collection with one hundred sub-sites each containing an “announcements” list with two custom fields “expiration date” and “publication date”. On the home page of the root site we want to place a CQWP that displays the latest five announcements (aggregated from all sub-sites) ordered by publication date and for which the expiration date is in the future. Knowing that all list item data is contained in a single database table you may think that it may be possible to retrieve the aggregated data in a single SQL query but, alas, this is not the case. If you have a closer look at the AllUserData table you will find out that it contains columns whose names go: nvarchar1, nvarchar2, …, int1, int2, …, datetime1, datetime2, … – these are the underlying storage placeholders for the various types of SharePoint fields in your lists. Obviously the “publication date” and “expiration date” will be stored in two of the “datetimeN” SQL columns but the important thing is that for the different lists the mappings may be totally different, e.g. for list 1 “publication date” and “expiration date” may map to datetime1 and datetime2 respectively, whereas for list 2 they can map to datetime3 and datetime4 respectively. This heterogeneous storage pattern makes the retrieval much more complex and time costly – the object model first needs to extract the metadata for all target lists in these one hundred sites (which contains the mappings for the fields) and after that retrieve the items from all one hundred lists one by one making a SQL union with the correct field to SQL columns mappings and applying the filtering and sorting after that. If you are interested in checking that yourself you can use the SQL profiler tool that comes with the MS SQL management studio.
    Having seen the performance issues that may arise with the usage of the cross list query built-in functionality of SharePoint, it is quite natural to check what SharePoint Search can offer as an alternative. Obviously it performs much faster in these cases and allows data retrieval and metadata filtering but the results and functionality it has are not exactly identical to the ones of the cross list query. And here we come to the second group of motives for implementing this kind of hack that I mentioned in the beginning of this paragraph. So let’s see some of the things that we’re missing in SharePoint search – from a data retrieval perspective – the text fields, especially the ones that contain HTML are returned by the search query with the mark-up stripped out (this is especially embarrassing for the publishing Image field type, whose values are stored as mark-up and get retrieved virtually empty by the search query); the “content type id” field is never crawled and cannot be used as a crawled and managed property; for the “lookup” field type (and derivative field types as the “user” type) – these are retrieved as plain text, with the lookup item ID contained in the field value stripped out; etc. From filtering and sorting perspective, you have pretty much everything needed – you can perform comparison operations on the basic value types – text, date, integer and float and perform the correct sorting based on the respective field type. What is missing is for instance the filtering on “lookup” (including “user”) fields based not on the textual value but on the integer (lookup ID) value – this is because this part of the lookup field value is simply ignored by the search crawler (we’ll come to that in the next part of the posting). For the same reason you cannot filter on the “content type id” field.
    The next question is of course is it possible to achieve these things with the SharePoint search – the answer is yes, and the hack that is a subject of this posting does exactly that.
  3. And lastly the third and most serious one – most of the time I am overly critical towards my own code and solutions, so I would normally not recommend using this hack (I will publish the source code in the second part of the posting), at least not in production environments. I would only suggest that you use it very limitedly in development/testing or small intranet environments if at all. I suppose that the material in the posting about some of the inner workings of the indexing engine and the SiteData web service would be interesting and useful by itself.

So, let’s now see how the index engine or more precisely the Sts3 protocol handler calls the SiteData web service. Basically you can track the SiteData.asmx invocations by simply checking the IIS logs of your web front server or servers (you have to have IIS logging enabled beforehand). If you first run a full crawl on one of your “SharePoint Site” content sources from the SSP admin site and after it completes open the latest IIS log file you will see that there will be many request to _vti_bin/SiteData.asmx and also to all pages and documents available in the SharePoint sites that were listed in the selected content source. It is logical to conclude that the protocol handler calls the SiteData web service to traverse the existing SharePoint hierarchy and to also fetch the available metadata for the SharePoint list items and documents and then it also opens every page and document and scans/indexes their contents so that they are later available for the full text search queries.

The checking of the IIS logs was in fact the first thing that I tried when I began investigating the SiteData-SharePoint Search relation but I was also curious to find out what method or methods exactly of the SiteData web service get called when the crawler runs. If you have a look at the documentation of the SiteData web service you will see that some of its methods like GetSite, GetWeb, GetListCollection, GetList, GetListItems, etc. look like ideal candidates for traversing the SharePoint site hierarchy starting from the site collection level down to the list item level. The IIS logs couldn’t help me here because they don’t track the POST body of the HTTP requests, which is exactly the place where the XML of the SOAP request is put. So I needed a little bit more verbose tracking here and I quickly came up with a bit ugly but working solution – I simply modified the global.asax of my test SharePoint web application like this:

<%@ Assembly Name="Microsoft.SharePoint"%><%@ Application Language="C#" Inherits="Microsoft.SharePoint.ApplicationRuntime.SPHttpApplication" %>

<%@ Import Namespace="System.IO" %>

<script RunAt="server">

 

    protected void Application_BeginRequest(object sender, EventArgs e)

    {

        TraceUri();

    }

 

    protected void TraceUri()

    {

        const string path = @"c:\temp\wssiis.log";

        try

        {

            HttpRequest request = HttpContext.Current.Request;

            DateTime date = DateTime.Now;

            string httpMethod = request.HttpMethod;

            string url = request.Url.ToString();

            string soapAction = request.Headers["SoapAction"] ?? string.Empty;

            string inputStream = string.Empty;

 

            if (string.Compare(httpMethod, "post", true) == 0)

            {

                request.InputStream.Position = 0;

                StreamReader sr = new StreamReader(request.InputStream);

                inputStream = sr.ReadToEnd();

                request.InputStream.Position = 0;

            }

 

            string msg = string.Format("{0}, {1}, {2}, {3}, {4}\r\n", date, httpMethod, url, soapAction, inputStream);

 

            File.AppendAllText(path, msg);

        }

        catch { }

    }

</script>

The code is pretty simple – it hooks onto the BeginRequest event of the HttpApplication class which effectively enables it to track several pieces of useful information for every HTTP request made against the target SharePoint web application. So, apart from the date and time of the request, the requested URL and the HTTP method (GET, POST or some other) I also track the “SoapAction” HTTP header which contains the name of the SOAP method for a web service call and also the POST content of the HTTP request which contains the XML of the SOAP request (in the case of a web service call). The SOAP request body contains all parameters that are passed to the web service method call, so by tracking this I could have everything I wanted – the exact web service method being called and the exact values of the parameters that were being passed to it. Just to quickly make an important note about this code – don’t use it for anything serious, I created it only for testing and quick tracking purposes.

With this small custom tracking of mine enabled I ran a full crawl of my test web application again and after the crawl completed I opened the log file (the tracking code writes to a plain text file in a hard-coded disc location) and to my surprise I saw that only two methods of the SiteData web service were called – GetContent and GetURLSegments. Actually the real job was obviously done by the GetContent method – there were about 30-35 calls to it, and only one call to GetURLSegments. You can see the actual trace file that I had after running the full crawl here. My test web application was very small containing only one site collection with a single site, so the trace file is very small and easy to follow. The fourth column contains something that looks like an URL address but this is in fact the value of the “SoapAction” HTTP header – the last part of this “URL” is in fact the actual method that was called in the SiteData web service. The fifth column contains the XML of the SOAP request that was used for the web service calls – you can see the parameters that were passed to the SiteData.GetContent method inside. If you check the MSDN documentation about the SiteData.GetContent method you will see that its first parameter is of type “ObjectType” which is an enumeration. The possible values of this enumeration are: VirtualServer, ContentDatabase, SiteCollection, Site, Folder, List, ListItem, ListItemAttachments. As one can deduce from this enumeration, the GetContent method is designed and obviously used for hierarchy traversing and metadata retrieval (the MSDN article explicitly mentions that in the yellow note box at the bottom). If you check the sample trace file from my test site again you will see that the calls made by the crawler indeed start with a call using ObjectData.VirtualServer and continue down the hierarchy with ObjectData.ContentDatabase, ObjectData.SiteCollection, etc. You may notice something interesting – after the calls with ObjectData.List there’re no calls with ObjectData.ListItem. Actually in the trace file there is only one call to GetContent using ObjectData.ListItem and it is invoked for the corresponding list item of the home (welcome) page of the site, which in my case was a publishing page. The other method of the SiteData web service – GetURLSegments is also called for the home page only – it basically returns the containing site and list of the corresponding list item by providing the URL of the page. And if you wonder which option is used for retrieving list items – it is neither the ObjectData.List nor the ObjectData.ListItem. The former returns an XML fragment containing mostly the list metadata and the latter the metadata of a single list item. The option that actually returns the metadata of multiple list items is ObjectData.Folder. Even though the name is a bit misleading this option can be used in two cases – to retrieve the files from a folder that is not in a SharePoint list or library (e.g. the root folder of a SharePoint site) or to retrieve the list items from a SharePoint list/library. If you check the sample trace file you will see that the GetContent method is not called with ObjectData.Folder for all lists – this is because the crawler is smart enough and doesn’t call it for empty lists (and most of the lists in my site were empty). And the crawler knows that a particular list is empty by the preceding GetContent (ObjectData.List) call which returns the ItemCount property of the list. There is one other interesting thing about how the crawler uses the GetContent with ObjectData.Folder – if the list contains a big number of items, the crawler doesn’t retrieve all of them with one call to GetContent but instead reads them in chunks of two thousand items each (the logic in SharePoint 2010 is even better – it determines the number of items in a batch depending on the number of fields that the items in the particular list have). And … about the return value of the GetContent method – it is in all cases an XML document that contains the metadata for the requested object or objects. It is interesting to note that the XML also contains the permissions data associated with the object which is obviously used by the indexing engine to maintain ACL-s for the various items in its index which allows the query engine to apply appropriate security trimming based on the permissions of the user that issues the search query. For the purposes of this posting we are mostly interested in the result XML for the ObjectData.List and ObjectData.Folder GetContent invocations – here’re two sample XML fragments from GetContent (List) and GetContent (Folder) calls. Well, indeed they seem quite … SharePoint-ish. Except for the permissions parts, the GetContent (Folder) yields pretty much the same XML as the standard Lists.GetListItems web service method. Have a look at the attributes containing the field values in the list items – these start with the well-known “ows_” prefix, which is the very same prefix that we see in the crawled properties associated with SharePoint content. Another small detail to note is that the GetContent (Folder)’s XML is not exactly well formed – for example it contains not properly escaped new line characters inside attribute values (not that this prevents it from rendering normally in IE) – I will come again to this point in the second part of this posting.

So far so good, but the results above are from a full crawl. And what happens when we run an incremental crawl? Have a look at the sample trace file that I got when I ran an incremental crawl on my test web application after i had changed several list items and had created a new sub-site and several lists in it. You can see that it contains again several calls to SiteData.GetContent, one call to SiteData.GetURLSegments and this time one call to SiteData.GetChanges. If you wonder why there is only one call to SiteData.GetChanges – a quick look at the result XML of this method will explain most of it. If you open the sample XML file you will see that the XML is something like a merged document from the results of the GetContent method for all levels from “ContentDatabase” down to “ListItem” … but containing only the parts of the SharePoint hierarchy whose leaf descendants (that is list items) got changed since the time of the last crawl. So basically, with one call the crawler can get all the changes in the entire content database … well, almost. Unless there are too many changes – in this cases the method is called several times each time retrieving a certain number of changes and then continuing after the reached book-mark. If you check the documentation of the GetChanges method in MSDN you will see that its first parameter is again of type ObjectData. Unlike the GetContent method however, you can use it here only with the “ContentDatabase” and “SiteCollection” values (the rest of the possible values of the enumeration are ignored and the returned XML if you use them is the same as with the “ContentDatabase” option). And one last thing in the case of the incremental crawl – the calls to the GetContent method are only for new site collections, sites and lists (which is normally to expect). The metadata for new, updated and deleted list items in existing lists is retrieved with the call to the GetChanges method.

So, this was in short the mechanism of the interaction between the SharePoint Search 2007 indexing engine (the Sts3 protocol handler) and the SharePoint SiteData web service. In the second part of this posting I will continue with explaining how I got to hack the SiteData web service and what the results of this hack were for the standard SharePoint search functionality.