Wednesday, April 4, 2012

Census 2010 Blocks in SQL Server 2008 Geography Spatial Type


Loading census data into SQL Server’s Geography Spatial Type should have been very straightforward.  Simply download the shapefiles from the census ftp site in batch mode, unzip, then import to a SDE feature class with the geometry storage type set as “GEOGRAPHY”. 

Unfortunately, I was not that lucky.  During the import process with ArcCatalog, the 8+ million blocks failed after loading around 2.5 million records with a variation of the following error:

Msg 6522, Level 16, State 1, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate "geography":
System.ArgumentException: 24200: The specified input does not represent a valid geography instance.
System.ArgumentException:
   at Microsoft.SqlServer.Types.SqlGeography.ConstructGeographyFromUserInput(GeoData g, Int32 srid)
   at Microsoft.SqlServer.Types.SqlGeography.GeographyFromText(OpenGisType type, SqlChars taggedText, Int32 srid)
.

Translation: There was some feature in the data that had a valid shape according to esri, but not Microsoft. 

There are some differences between how esri and Microsoft want the polygon shapes to be drawn, specifically with the order in which the vertices are drawn (clockwise vs counterclockwise) for the exterior (outer boundary) ring(s) and interior (donut hole) ring(s). The first attempt was to create a button in ArcMap which read the geometry from the SDEBINARY geodatabase, created the Well Known Text and then geography for each feature (accounting for the correct draw orientation), and then pushed the results into a SQL Server table using ADO.net.   

The geometry creation code looks something like this:

Dim geo As New SqlGeometry()
geo = Microsoft.SqlServer.Types.SqlGeometry.STGeomFromText(New SqlChars(New SqlString(myshapecalc)), 4326).MakeValid
mynewrow("Shape") = geo.MakeValid

I was fairly happy with my first results, until I remembered I wasn’t dealing with multipart geometry, empty geometry, multiple interior rings, and other geo quirks.  After some quick modifications, the code was done and the results were populated, however I still hadnt solved the problem of getting my data into GEOGRAPHY.  When I modified the code to pump GEOGRAPHY not GEOMETRY, I had the same failure.  Did a great job of solving the wrong problem – at least now I have a esri to SQL Geometry table loader. 

I was definitely surprised to realize that you can in fact push unprojected data into the geometry data type, using the SRID from a geographic coordinate system.  I am not really sure why this is not an acceptable approach, other than geography is typically for Lat / Long coordinates, and geometry is for feet / meters units.  There are definitely stronger limitations on what can be pushed into each – geography being much more stringent.  One we realized this, I pushed all the polygons (WGS 84 projection) into a geometry dataset using ArcCatalog, then tried to convert the data to geography from there. 

One recommendation we found was to use the geometry STUnion and MakeValid functions to try to fix the shape’s validity from a geography perspective, which can be seen here:

geography::STGeomFromText(shape.MakeValid().STUnion(shape.STStartPoint()).STAsText(),4326)

This is taking the source polygon, running the MakeValid() operation, then unioning the polygon with it’s first vertex, and finally converting the result to geography.  Doing so re-orders the vertices so that they are drawn in the correct order. 

Using this query, I was able to start pumping Geography records, unfortunately we still found that the conversion failed. 
At this point, I needed to isolate the record or records that were causing the process to fail, so I created this query:

declare @objectid int

Declare rowcursor CURSOR FAST_FORWARD FOR

select objectid
from dbo.CENSUS_2010_TABBLOCK_GEOMETRY_IMPORT2
order by objectid

OPEN rowcursor
FETCH NEXT FROM rowcursor
INTO @objectid

WHILE @@FETCH_STATUS = 0
BEGIN

declare @sqlinsert varchar(max)
set @sqlinsert = '
insert into
dbo.CENSUS_2010_TABBLOCK_GEOGRAPHY
select objectid, blkidfp00,
geography::STGeomFromText(shape.MakeValid().STUnion(shape.STStartPoint()).STAsText(),4326)
from
dbo.CENSUS_2010_TABBLOCK_GEOMETRY_IMPORT2
where objectid = ' + convert(varchar(255),@objectid)

begin try
exec(@sqlinsert)
end try
begin catch
      insert into geography_failures
      select @objectid

end catch

FETCH NEXT FROM rowcursor
       INTO @objectid
END

CLOSE rowcursor
DEALLOCATE rowcursor

Here, I am taking all input records, iteration thru each, attempting to push the input row into a geography table, and where that doesn’t work, dumping the result into an error table.  This was very slow to run (1 day 21 hours), but it ended up showing me the feature that was causing the problem:

POLYGON ((-73.539727170912329 41.134362625800748, -73.534610028565623 41.142645381690613, -73.538288255665464 41.136691869856634, -73.539808109894977 41.1357116088447, -73.540023947182021 41.134416585122494, -73.53999696752112 41.134416585122494, -73.539727170912329 41.134362625800748))

On a map, a pretty standard polygon with a pretty ugly and unneeded set of vertices shooting off in one direction. 


Unfortunately, there is not really a clean way of handling this shape, as far as I can determine, other than to manipulate the shape coordinates, and manually load this single feature into the table.  

In the end, we have learned there are some issues with the census 2010 geometry, including gaps, overlaps, and funky topological discrepancies.  We also learned that it can be a major hassle with significant roadblocks to try to ditch esri formats when performing spatial operations inside the SQL database.  Finally, we re-learned during QA that esri handles cluster tolerance and spatial resolution, whereas SQL does not.  In this instance, esri was giving bad results along some polygon borders, whereas SQL was giving correct polygon association every time. 

But, this effort is definitely worth it.  With proper spatial indexing and other logic, I was able to assign census blocks to 128 million input records in about 6 hours, which is way faster than any other esri options.  And we have some very handy reusable code for the future.