ontolawgy™ – connect the dots™: May 2012

Here's a two-part post, after a very long absence from the blog. I have been busy with several other projects, but I am gearing up to participate in the "legal hacks" event (see http://legalhacks.org) very soon, and as a result, am revisiting some issues related to organizing code-like legal materials.

The first part of this post discusses organizing materials once they have been obtained, and the second part discussed some of the challenges in obtaining them. A bit backwards, perhaps, but the first part is likely more relevant to the upcoming event.

Part I: Organizing code-like legal materials, or, eating legislative sausage

To rehash the old adage that making laws is like making sausage, organizing them after the fact is very much like eating it, and below is my general method for doing that, for whatever it's worth. As always, I welcome any feedback, questions, or suggestions.

I read with great interest two recent blog posts by Thomas Bruce of the Legal Information Institute at Cornell and Grant Vergottini of legix.info about challenges in organizing legislative/legal data using different types of identifying information. Generally, Mr. Bruce's post describes the functions that identifying information can serve. These are summarized below:

a) “Unique naming”, i.e., assigning a specific name to a legal provision within a system

b) “Navigational reference”, which is similar to navigating a filesystem

c) “Retrieval hook/container label”, i.e., to use a citation as a placeholder to aggregate lower-level content that is stored in other locations/records

d) “Thread tag/associative marker”, i.e., grouping of related documents in “threads”; one example he uses is a “captive search” URI, but in my view, this is mainly another way to get at a retrieval hook

e) “Process milestone”, i.e., inferring some meaning from the official status of a document, e.g., if a bill has been assigned a Public Law number, it has presumably been enacted into law.

f) “Proxy for provenance”, e.g., the existence of a bill number means that legislation has been officially noticed in some way.

g) “Popular names, professional terms of art, and other vernacular uses”, e.g., the Social Security Act, the Stark Law, the Anti-Kickback Statute (to use some of the examples with which I am most familiar).

Mr. Vergottini goes into the issues surrounding selecting frameworks to be used to actually implement those kinds of identifiers, e.g., via a URN or URL-based system, and discusses some of the difficulties inherent in selecting and implementing a system to capture relevant data in a machine-readable way. He also identifies problems with viewing different portions of text, as well as tracking text that gets amended or redesignated.

Common problems Messrs. Bruce and Vergottini both discuss include documents/provisions with identical names/identifiers in an official classification system (e.g., the two subparagraphs 42 U.S.C. § 1320a-7b(b)(3)(H) that coexisted for seven years until fixed by Pub. L. 111-148 § 3301), or how to store temporally different versions of text.

I started building the ontolawgy™platform (a web-based legal analysis system) about 6 years ago for my regulatory practice, and I ran into the problems discussed above quite early. Here are some of the approaches I have taken to address them:

Treat every textual division as a unique document, and allow it to be accessed via a unique URL based on its location in the government taxonomy (a - c in Mr. Bruce's overview).
Store each descriptive element about that document in a tag/field. This includes official and unofficial “popular names” (e.g., the Social Security Act), section numbers within those popular names, section numbers of the U.S. Code, Public Law enacting provisions, etc. (c - g)
Allow users to query on any of those elements. (a - g)
Track duplicates and give them distinct records that are still retrieved in an appropriate way using their descriptive tags/fields. (a - d, g)
Track each provision using its current designation, but maintain a full locative, temporal, and ontological history within the record and the system. (a - e, g) For example, 42 U.S.C. § 1320a-7b(b)(3)(I) used to be the second 42 U.S.C. 1320a-7b(b)(3)(H) that was enacted by Pub. L. 108-173 § 431 (the first subparagraph (H) was enacted by § 237 of the same Public Law); the system tracks all that information and allows users to query it and, e.g., gather together all historical versions of subparagraph (H) to track how it has changed over time.

As for the mechanics, when I started building the system, my main goal was to get up and running quickly with a free, open-source, off-the-shelf system. The system is extremely flexible, has a very active development community, and still works quite well. While it does not currently use any sort of (proposed) standard like URN:lex or Akoma Ntoso, it does use inline markup, and thus, should be easily convertible to a legal markup standard once one is in place.

I can't go into much more detail here, but please contact me to get access to my demo system if you would like to see it in action.

Part II: Obtaining legal source materials, or, how the government makes sausage even messier

All that said, one significant challenge I still face is getting rational raw data from official sources. Indentation can be highly relevant semantically, depending on the subject matter, but official sources either just do away with indentation altogether (I'm looking at you, Code of Federal Regulations) or present it in such an inconsistent format that it might as well not be there (U.S. Code).

Back to the sausage. Essentially, we pay the government to make legal sausage, cook the sausage, and serve it to us, but just before they serve it, they mash it up, smear it around the plate, then take away our silverware and tie our hands behind our backs. I spend much more time than should be necessary simply ensuring that the materials I work with are properly indented to accurately reflect their meaning. I've written several small programs to do about 95% of the work, but that remaining 5% can be almost maddening, particularly when dealing with multiple levels of unenumerated flush text. The materials are certainly drafted with visible indentation (take a look at Public Laws: all the indentation is there and correct), but all this useful information gets stripped out at some point in the publication process, and it is not at all clear to me why this happens. The U.S. Code uses “bell codes” for typesetting print documents, but this doesn't excuse the lack of indentation in electronic publications.

The C.F.R. is even more maddening: This document claims that the XML format of the C.F.R. “is a complete and faithful representation of the Code of Federal Regulations, which
matches most closely to the author's original intent... [and] fully describes the structure of the Code of Federal Regulations, including the large structure (chapters, parts, sections, etc.), the document structure (paragraphs, etc.), and semantic structure” then goes on to explain that the SGML indentation for subsections, paragraphs, subparagraphs, clauses, etc. have all been collapsed to a the same single tag. This means that every last bit of indentation/separation (except for line breaks) within each section—“sections” can be very long and complex, with multiple nested levels of semantically-relevant indentation—has been completely stripped from all publicly-available electronic materials. How is this supposed to help the public?

The LII has generally addressed indentation issues in its publication of the U.S. Code (See http://www.law.cornell.edu/uscode/text/42/1395ww for an example), and content is freely available for viewing and non-commercial re-publication.

LII's new Code of Federal Regulations (CFR) system, the result of a close collaboration with the government, also does an excellent job of organizing and indenting CFR data the way it was meant to be read: it is the only freely-available resource of which I am aware that does this.

While the LII's sites offer a valuable public service, they do not solve the underlying problem: Properly indented content is not freely available to the public from the government for commercial re-use, even though these government works are in the public domain. Why is this a problem? Because official platitudes notwithstanding, government publications significantly obscure or corrupt the intended meaning and scope of the laws that govern us.

If anyone has some insight about how to get the government to bring useful and accurate indentation to its official publications, please get in touch, I would be thrilled to work with you to help make this happen.

2012-05-15

How to eat legislative sausage

Subscribe Now

Contributors

Blog Archive

Followers