Getting the PHI Out (while leaving colors bright and vibrant) [Part 1]

One of our larger projects right now is a database-backed web application for pediatric hearing research called AudGenDB. Part of the work there involves carefully extracting and cleaning data from the electronic health record (EHR). I could write a whole post on this process alone, but one very important step is scrubbing protected health information (PHI). When using clinical data for research purposes, all PHI needs to be removed from the data, not only because that’s the ethically proper thing to do, but it’s also a requirement of HIPAA and Institutional Review Boards (IRB’s).

In our specific case, we had some free text radiology impressions that needed to be scanned and have PHI removed. Our definition of “some” in this case is 18,314 records with a whopping 900,000 lines of text. Ugh. One bright spot was that as medical records go, radiology impressions actually are relatively free of most kinds of PHI. Of course, the downside is that you might easily miss the one record in 5,000 that has a patient medical record number in it because PHI is so rare. Furthermore, in addition to removing identifiers, we also need to ensure that the documents aren’t so redacted as to not be useful for research purposes.

Choosing the Right Tool:  To the Literature!

Fortunately, we’re not the only people with this problem, and in fact there is a wealth of research into automated de-identification of text. In particular, this review article [1] gives quite a good survey of the current state of things. Reading through it though, my initial optimism was dimmed somewhat when I got to Table 1. My requirements were (I thought) pretty simple:

  1. Open source (or at least publicly available)
  2. Comprehensive coverage of all the major kinds of PHI
  3. Relative ease of use and deployment using a reasonably common programming language/architecture.

Scanning the list, there were few that seemed to meet the criteria. Fortunately, the deid program from MIT [2] hosted on PhysioNet [3] seemed to be a good fit. The documentation is pretty extensive and there is a nice publication that describes their approach and validation process.  Briefly, deid uses simple regular expression pattern matching combined with various word lists to identify PHI. The pattern matching was attractive to me because I really didn’t want to be playing around with training sets and all the validation that needs to be done training machine learning algorithms. In theory, this means deid could probably work on anything you give it right out of the box.

I should take this opportunity to note that while I refer to the MIT package here as “deid”, there is a commercial software package from DE-ID Data Corp that goes by a similar name. The software I am calling “deid” here is not the same as the commercial solution. 

Customizing Deid

A quick first pass with the vanilla deid made it clear that lots of customization was going to be needed. Fortunately, since deid uses various vocabulary lists to work its de-identification magic, there is no need to dig into source code and risk breaking the validated code. The base vocabulary lists are just not comprehensive enough or customized enough to handle data beyond what they ship in the example files. To be fair, this is something the authors make abundantly clear in the documentation. Qualitatively scanning the output showed a good number of both false-positives and and false negatives for our radiology data set. I’ll save the detailed specifics of how one customizes the vocabularies used by deid for a follow-up post, but briefly here are the highlights:

Place names

I used place names from the entire United States by extracting them from US Census lists. These provide a wealth of “free” proper names that show up in other contexts (like school and university names for example).

Hospital and Medical Center Names

I would still ideally like a comprehensive list of all US Hospitals, but CHOP has an internal list of the common medical centers for the area that I used to then bootstrap into a more comprehensive list. Most hospital names get flagged anyway based on other proper names (e.g. “Johns Hopkins”) or place names (e.g. “Temple University Hospital”). One difficulty with hospitals is that their often unwieldy proper names cause busy clinicians to abbreviate them in the health record. Catching the abbreviations and common typos (or typos of the abbreviations!) can be tricky but it’s doable.


Deid actually has a pretty comprehensive list of common first and last names based again on US Census data. There will always be cases where a unique name needs to be removed. Deid does require a list of medical record numbers and patient names so that a patient’s name is always removed from an individual record, but this is not always enough. In a pediatric populations, parent names are very frequently included in notes and may not always match the child’s last name. So in the rare case that a parent last name differs from the child and the parent’s first name is something very unique, you might have a PHI leak.

Physician and Nurse Names:

In theory it’s possible to use these these, but in practice most names are caught using the generic deid name lists mentioned above anyway. I found that because of the size of the CHOP Care Network, adding these increased the false positive rate too much so I left them out.

Medical Terms:

By far the biggest issue was making sure legitimate medical terms weren’t removed. In practice there are diminishing returns to trying to catch all of these, and they are highly dependent on the dataset (see below).

Why, Hello Mister Brain!

One of the most perplexing findings early on was that I was seeing deid throw out the word “brain” quite frequently. Unfortunately, “brain” is in the pre-defined list of ambiguous common US last names (who knew?). More unfortunately, our radiology impressions often abbreviate “MRI” as “MR”. Given that most of our impressions are from studies of the head, we have a disproportionate number of descriptions that have “MR Brain” somewhere. This leads to what I call the “Mister Brain problem”: Deid sees “MR” as a title preceding an ambiguous last name and then flags the ambiguous last name (“Brain” in this example) as a real last name because of the context. Adding “MR Brain” to a whitelist of medical terms helps, but doesn’t seem to completely eliminate the problem entirely (the mechanism for why this is so remains a mystery to me). After a brief exchange with the authors, it seems there is no real way to eliminate this problem short of modifying the source code. I’m loathe to do this since it would have unpredictable consequences and I would probably need to re-evaluate the accuracy of the program on the test data set again.

Creating PHI From Thin Air

Another related issue to the Mister Brain problem is that deid can unintentionally reveal someone’s last name! Here is how it happens: In addition to the standard list of common first and last names that ship with deid, you must provide it with a list of first and last names matched to medical record numbers (MRN’s). For a given record, if the first or last name appears anywhere, it’s always removed and replaced with something like “[** Patient Last Name **]”.

This is pretty useful for showing the context when a name is removed, but it’s a killer when deid gets the context wrong. For example, let’s suppose you actually do have a patient with the last name of brain. Now each occurrence of the quite innocuous word “brain” in the report will become [** Patient Last Name **]. From the context, it will be easy for the reader to see the missing word is “Brain” and infer the identity of the patient. For example:

MR of [** Patient Last Name **] and spine: The [** Patient Last Name **] and skull are visible…

Not good! The solution I went with was to post-process the output to remove some of the context so that it’s not clear that its the patient’s name being removed. There are enough false-positives sprinkled around in the results that this pretty much obscures this situation. This is obviously a sub-optimal approach, but the situation happens rarely enough in our data set that it’s a pretty minor concern. I should take this opportunity to point out that I have deliberately made up this example. The real cases we have seen use different words and last names, but have a similar pattern to the one above. Despite being very common last names, to protect our patient’s privacy, I felt it best to use a fake example.

So does it work?

After all the various customizations to lists and tweaks, we found that deid performed very well. After multiple iterations, we have a high degree of confidence that PHI has been scrubbed at least as well as a human could do it. We are working with our IRB to put in place a systematic review to quantify the success rate and report any problems over time (we just need to find some summer interns who want to read 18,000+ radiology impressions).

Without any customizations for medical terms, here is a sampling of the top terms removed because deid thought they might be names:

Number of Times Word Removed
87 demonstrates
100 CHOP
116 monroe
123 examination
139 of
214 brain
273 Robin
278 monro
1419 willis
6838 tse

Note that “tse” is an imaging abbreviation (“Turbo Spin Echo”), but “tse” was categorized by deid as an “unambiguous last name” so it was always removed no matter what. Adding “TSE” as a medical term and then reclassifying it as an ambiguous last name fixes this issue. A similar issues exists with “Willis” (“Circle of Willis“) and “Robin” (“Virchow-Robin space”). Finally there is “Foramen of Monro“, which is spelled incorrectly (“Monroe“) nearly 1/3 of the time. Most of the others are variations of the Mister Brain problem. Again, none of the above data are actual patient names.

After fixing some of those issues we have the following (actual physician names removed):

Number of Times Word Removed
.. ..
32 Physician_Name1
37 angiogram
42 brain
47 Physician_Name2
64 Physician_Name3
65 Physician_Name4
81 r/o
87 demonstrates
100 CHOP
123 examination

This looks much better. Not perfect certainly, but when you consider the number of notes these are very small numbers. There are also a good number of physician names automatically removed. Given the sheer number of notes, redaction of 123 instances of “examination” is just not significant (note that “brain” is incorrectly removed only 42 times despite the word occurring over 21,000 times in notes).

Below is the final breakdown of all the kinds of PHI removed from the notes. The counts are based on line numbers so that multiple instances on the same line count as 1 (that was easiest to do with our good friend “grep“). Note that there are 880,711 lines in the current file.

Lines Containing one or More instances Percent of Lines with instance Type
24583 2.79 Date
1758 0.2 Last Name
1141 0.12 First Name
159 0.02 Name
185 0.02 Hospital
173 0.02 Location
9 0.001 MRN* (*see also Ambig. Num. below)
419 0.05 Phone
29 0.003 Pager No.
40 0.005 Initials
4 0.0005 Street Address
1 0.0001 State/ZIP
26 0.003 Ambiguous Number (nearly all MRN)

There were a number of instances where deid failed to locate the MRN and we had to filter them in post-processing.

How to Customize Deid for Yourself

I will detail the steps for customizing deid and talk more about post-processing and analyzing the output in a future post. Stay tuned!


  1. Meystre S, Friedlin F, South B, Shen S, and Samore M. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Medical Research Methodology, 2010, 10:70, doi:10.1186/1471-2288-10-70
  2. Neamatullah I, Douglass M, Lehman LH, Reisner A, Villarroel M, Long WJ, Szolovits P, Moody GB, Mark RG, Clifford GD. Automated De-Identification of Free-Text Medical Records. BMC Medical Informatics and Decision Making, 2008, 8:32. doi:10.1186/1472-6947-8-32
  3. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and Physionet: Components of a New Research Resource for Complex Physiologic Signals. Circulation101(23):e215-e220 [Circulation Electronic Pages;]; 2000 (June 13).
This entry was posted in Tutorial and tagged , , , , , , . Bookmark the permalink.

One Response to Getting the PHI Out (while leaving colors bright and vibrant) [Part 1]

  1. Pingback: Getting the PHI Out (while leaving colors bright and vibrant) [Part 2] | Informatics 360

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s