Content crawling in Office SharePoint fails
Problem
Office SharePoint fails to fully crawl a content source containing Excel files with think-cell links and the following message is found in the crawl log:
Error in the Site Data Web Service. (Invalid high surrogate character (0xXXXX). A high surrogate character must have a value from range (0xD800 - 0xDBFF).)
Cause
This is due to a bug in Excel 2000 and Excel XP that results in the generation of Excel files with corrupt metadata. The problem occurs when a string custom document property with a linked source is added to an Excel document and the source cannot be resolved. In later versions of Excel the document property value is set to something valid (e.g. an empty string). In Excel 2000 and Excel XP, however, the value contains garbage and may cause the Office SharePoint crawler to fail. The Excel documentation explicitly states that the document property value is set to a default value before being updated when the source is resolved, and so this behavior is an Excel 2000 and Excel XP bug.
The problem can be reproduced using the following steps:
- Use the following link to download a very simple Excel file: LinkSourceProp.xls.
- Load the file using Office 2000 or Office XP Excel, ensuring macros are active.
- Press Alt+F11 to open the macro window and run the AddDocumentProperty routine.
- Go to File > Properties and select the Custom tab.
- The value associated with the newly added TestProperty is garbage.
Solution
think-cell uses custom document properties and, after noticing this behavior, we altered our code to add our document properties with type boolean rather than string. Both Excel 2000 and Excel XP set the document property to a valid boolean value and this value remains valid if the link source cannot be resolved.
Files created using think-cell 5.0 and higher use this workaround and should be successfully crawlable by Office SharePoint.
Please contact Microsoft Office Support directly for advice about repairing corrupt document property values in Excel 2000 or Excel XP generated files.