Medical Research Data

Background: Medical research data is often de-identified, and does not fall under HIPAA restrictions. This is not universally true, and it is critical to monitor the data this system is sharing for potential HIPAA violations. This is not meant to imply that the Thebes infrastructure is not usable under HIPAA rules, rather to state that federal regulations will require a much more stringent set of rules on patient identifiable data, and it may never be that this data can be readily shared across enterprises.

However, de-identified research data is already being shared in the scientific community. The largest and best funded effort in this arena is the cancer BioInformatics Grid (caBIG) , sponsored and created by the National Cancer Institute (NCI) in the Unites States. After spending billions of dollars to create the finest collection of bioinformatics data every generated anywhere on the planet, they realized that they had created wonderful data silos that could not be readily spanned. This is not a condemnation of NCI, in fact NCI was the first major international player in the data grid arena, and has contributed an incredible effort to the development of concepts, standards, and development. NCI took a firm policy against re-creating things that existed elsewhere; this policy has elevated existing development while NCI only funded creation of software they needed to connect existing tools together.

One of the first things NCI did was create an enterprise vocabulary. This model is one that has been referred to elsewhere in this paper. It was realized early on that if every data element in any database connected to the grid was not clearly defined and published, there would never be a way to conduct queries across the disparate databases. Some concurrent efforts in Europe opted to create translation tools to put in front of each database to solve this problem. It seems clear that whenever possible the creation of an enterprise vocabulary is a more efficient way to create a data grid, particularly when working with a clean slate. Translation programs only make sense when working with legacy databases that would prove to be more difficult to replace then translate.

Of course, the NCI Enterprise Vocabulary is open source and freely available. It shows no sign of going anywhere. Those considering creating a medical research data system would be well served to study and potentially expand or adopt their work. No data grid can be sustained for long without an enterprise vocabulary.

In a top-down environment, the creation of a data vocabulary is no more difficult then it would be when creating a single database. In other cases, there may be a lengthy and rather painful process to agree upon this vocabulary. Either way, the results are well worth the effort.

Once the vocabulary is in place, the Thebes infrastructure can be used to ease the complexities involved in the actual sharing of the research data. Resource discovery tools facilitate finding databases. Attribute based access to databases with appropriate policy limitations opens the door to searching of data.

Actors:

Developers: Most of the development efforts are outside the scope of a Thebes use case document, however it should be documented that a Thebes infrastructure combined with an enterprise vocabulary would allow greater ease of development of research databases. Knowing that the tools exist for secure sharing of data, database discovery, and policy creation and enforcement, and that there is a guarantee that any term used to identify a data point will be well understood and widely accepted will allow developers to create rich databases and detailed data discovery mechanisms to service these databases.

Systems administrators: Administrators at each research facility will connect the identity provider to the local identity store, install the various databases, and connect these databases to one or more nearby resource discovery nodes. Each researcher needs custom client software that plugs into the Thebes infrastructure. Authorization to the databases will be accomplished via the Thebes plug-in for both data entry and distributed queries.

Researchers: Whether the researcher is going to enter data or perform queries on local or distributed data, they will use a custom database access client to perform their work. One element of this client will be the Thebes plug-in, allowing the researcher to assert their attributes to gain appropriate access, for example write access to certain local tables, and read access to other local and certain distributed tables. There will be cases where a researcher’s credentials will gain write access to a remote collaborator’s database, using the same sign-on.

Local Management and Senior Staff: In this model, upper management is relieved of any responsibilities for creating data schemas, as the enterprise vocabulary will be an industry-wide creation rather then a local creation. In theory, there will be a process of deciding whether or not to accept the vocabulary, but failure to do so would amount to intellectual disbarment from the domain specific community. Management will continue to be involved in the decision of what to share and who to share with, although even some of those decision may be beyond local control as funding agencies wield their control and researchers demand certain connections be made with peer’s institutions. Some data will have a financial value attached to it, and accounting facilities will be attached to the policy controls, so proper invoicing can take place. This system facilitates the accounting for usage of both resource time and data consumption.