Getting the PHI Out (while leaving colors bright and vibrant) [Part 2]

In a previous post, I described our experiences using a software package from MIT (“deid”) for automated removal of PHI from radiology impressions. The technical details of how to prepare a dataset and customize deid would have made that post lengthy (and probably boring) for anyone who wasn’t actually going to sit down and use the application right away. However, feeling the need to contribute back to the “long tail” of knowledge on the Internet, I decided a tutorial post with all the details might be useful. With that in mind, let’s roll up our sleeves and get started!


These instructions assume you’re using a computer operating system based on Unix such as Linux, Mac OS X, Solaris, etc… and that you’re reasonably comfortable using the command line. If you’re using Windows, the spirit of what I’m saying here will be useful to you, but the specific commands won’t work by default. Most of the utilities mentioned here are available as standard command line commands from the GnuWin project. While they haven’t been tested on Windows, you should be able to use the example commands here as a starting point.

Also, this is not meant to be a comprehensive tutorial. The deid documentation is quite good, so I’m writing this assuming you’re already familiar with where all the files are and what is provided. I’ve tried to focus this post on the things learned from experience that aren’t spelled out explicitly in the docs.

Customizing Vocabularies

Deid uses various vocabulary lists to match words against known PHI. The nice thing about this is that it doesn’t require you to dig into source code and risk breaking its carefully validated routines. As the authors make abundantly clear, the base vocabulary lists are not comprehensive or customized enough to handle data beyond what is included in the example files. Don’t think you can skip the customization and just run deid blindly. The results will likely be unacceptable to you with the wrong things kept and removed.

Place Names

In general, my preference is to be as complete as possible when removing the names of places from a dataset. The U.S. Census has a comprehensive list of places in the United States that I adapted for this purpose (you can get the list at this URL: It contains county-level subdivisions such as towns, villages, and regions. In many cases, the Census adds the type of place to the end of a name (e.g. “Philadelphia city”). For the best accuracy, it’s wise to remove the modifiers. For example, it’s more likely you will see a match for “Philadelphia” than the entire “Philadelphia city”. You could probably limit this to just the states nearest your location since that’s likely to cover most of your PHI needs. However, I used the whole U.S. on the theory that it supplies some “free” common proper names that might hit things that aren’t necessarily geographic locations, but still things I want removed . For example: from “Drexel, MO” we get the proper name “Drexel” which is also the name of a university just down the street from our hospital. Obviously, this only helps you for U.S. location names. If you’re using this in another part of the world, you will likely have to find a similar data set to use.

Here’s a snippet of the data in that file:

PA4259616Petersburg borough                                                    455      193        996913             0    0.384910    0.000000 40.572417 -78.048036
PA4259672Petrolia borough                                                      218       99       1038529             0    0.400978    0.000000 41.017964 -79.718204
PA4260000Philadelphia city                                                 1517550   661958     349881748      19544394  135.090104    7.546133 39.998012 -75.144793
PA4260008Philipsburg borough                                                  3056     1527       2123711             0    0.819969    0.000000 40.897315 -78.218950
PA4260120Phoenixville borough                                                14788     6793       9296297        422687    3.589320    0.163200 40.130819 -75.519061
PA4260136Picture Rocks borough                                                 693      288       2424560             0    0.936128    0.000000 41.280066 -76.711731
PA4260264Pillow borough                                                        304      139       1240809             0    0.479079    0.000000 40.640430 -76.803464

This data needs to be processed inside a good text editor. You’ll need to:

    • Remove all the columns of numerical data at the end of each line
    • Remove the leading 9 characters of each line
    • Remove the following (trailing) place name modifiers: CCD, UT, census subarea, township, precinct, city, town, charter, grant, purchase, borough, village, district, barrio, -pueblo
    • Remove the following (leading) modifiers: Municipality of <place_name>, District <number> <name>, Precinct <number> <name>, Township <number> <name>
    • Remove the following entirely since they have no value for our purposes: County subdivisions not defined, Township No. <number>

After all of that manipulation, you should be left with a file that has a column of clean place names.

For best accuracy, place names need to be separated into ambiguous and unambiguous lists. This is to prevent ordinary words from being removed too aggressively. To do this, I used the SCOWL (Spell Checker Oriented Word List) 7.1 corpus. You can obtain it from SourceForge. The following command uses the SCOWL corpus to generate a list of common words (excluding proper names). It needs to be run in the directory where the english word lists reside (in the 7.1 release it’s in scowl-7.1/final):

cat *english*words* *british-words* variant_*-words*  > all_words.txt

The resulting “all_words.txt” file can then be used to filter the place names into ambiguous and unambiguous lists. I used a little Python script to do this, but you could easily do it with just about any programming language you want.

Hospital/Medical Center Names

Unfortunately, this list poses the most trouble since ideally you’d like a list of all hospitals in your region complete with their colloquial names (e.g. “CHOP” as well as “The Children’s Hospital of Philadelphia”).  In our case, I got an internal list maintained by our institution and then used multiple runs of deid to refine and add to the list. It’s far from a perfect solution, but  it has worked reasonably well. The good news is that in general, once you have this list, it can be used on multiple projects unless there’s a bias in your data set that would prefer certain hospitals over others.

Patient Names and MRN’s

This category of information is the one that keeps all of us informatics folks up at night. For deid to work properly you must provide the names of patients and their medical record numbers. This includes first and last names as well as MRN. Deid contains a comprehensive list of suffixes and prefixes so things like “Jr.”, “Sr.”, ”III”, etc… are automatically detected with nothing required on your part. One thing to be aware of is that deid has no facility for including more than one name for a patient or more than one MRN. This can cause problems in cases where parents or caregivers names are present in notes, though in practice most of those get detected by the general name list that deid has. We only encountered issues when a parent had a highly unusual first name and a last name different from their child’s. The format of the file is as follows:


There are some important things to be aware of with respect to deid removing names and MRN’s. Deid will always remove a patient’s name from a note, no matter the context. Other names get removed based on things like prefixes (“Dr.”, “Mr.”, “Ms”, “House, M.D.”, etc…) or matches to unambiguous name lists. In theory, the removal of a patient’s name sounds like it should always be what you want. However, as noted in the previous post, this can have surprising consequences. Most notably, in our system, MRI’s of the head often include references to “white” matter. Given the large number of individuals with the last name of “White”, this actually ends up revealing the patient’s last name since “…the white matter can be seen…” is replaced with “the [**Patient’s Last Name **] matter can be seen…”. To avoid this specific edge-case, this scenario can be mitigated after the fact (see below)

Physician Names

While not strictly PHI in the HIPAA sense, physician names are still not needed in a research dataset, so making a best effort to remove them seems prudent. Deid provides specific lists for physician names. However, we found when using our internal lists generated from our EHR, this introduced a lack of specificity. Our EHR includes some “placeholder” names that can confuse deid, and we found that the fallback out of the box name removal worked fairly well for physician names.

Formatting Records

Once you have your initial lists of dictionary terms, it’s time to create a file containing the records you wish to de-identify. The format is as follows:

The text of the record goes here

Here’s an example (de-identified of course!):

CLINICAL INDICATION: Headaches. CT showed large
posterior fossa mass. 
BRAIN: Sagittal 3D T1 gradient echo with axial reformations, axial
3D TSE T2, sagittal TSE T2, axial and coronal FLAIR,
post-contrast sagittal 3D T1 gradient echo with axial and coronal
reformations, post-contrast axial spin echo T1 with fat
suppression, axial diffusion weighted imaging were performed on a
3.0 Tesla system.

...Rest of the note follows....

Running Deid

Once you’ve got your data ready to go and all of your lists are customized, you’re finally ready to run the program. There is a configuration file, but it’s well documented so I won’t supply the details here. The major configuration changes I made were to disable the physician name list and make sure deid was not running in the “gold standard corpus” mode. You have to at least make the latter change since you’re running on your own data and not the gold standard data.

Here’s the command to run the program:

perl your_notes_here deid.config

Note that you need to pass the config file in as well as the file name you’re using. One minor annoyance that caused me grief at first: the input file that contains the notes must have the extension “.text” and you cannot pass the extension on the command line. So in the example above, there would be a file called “your_notes_here.text”. If I had actually passed the full filename, I would have gotten an error. Once it’s running, you’ll have some time to kill depending on the speed of your computer and the amount of data you’re processing.

Analyzing Deid Output

Deid will produce several output files:

  • *.phi file- Includes the locations of all phi identified by deid on a per-note basis.
  • *.res file- The actual de-identified output
  • *.info file- A listing of words that were and were not removed as well as the identity assigned to each. Words identified, but not removed are on lines with a “#”

Identifying Phrases Removed Incorrectly

In general, I found the .info file to be the most useful file to find out how deid is performing. It is especially helpful in identifying words that deid thinks are names, but are not removed because they don’t meet some other criterion. Likewise, it is useful to find things that were removed that are probably not PHI. To get the most out of it on a large data set without pulling your hair out, you’ll need to brush up on the unix tools grep, sed, and uniq. Those, combined with some hefty regular expressions can generate the necessary lists for further analysis. As mentioned above, if you’re on Windows, there are  ports of these tools you can use as well.

In general, I found the following approach useful: do an initial run, identify all the medical terms that must be white-listed, add them to the medical terms list and repeat. Once that is complete, one can then move on to finding names that are not being properly removed. The following command will give counts and a list of each unique item removed from a note, excluding all the lines that list out the patient info:

grep -v '#' | grep -v '^Patient' | sed -e 's/[0-9]*\t[0-9]*\t\([^\t]*\)\t.*$/\1/g' -e 's/[^0-9A-Za-z]$//g' | sort | uniq -i -c | sort -n | more

Whoa! That’s a monster command, let’s break it down:

grep -v #
Recall that the info file marks words that were identified but not removed with a “#”. This command will output all the lines in the info file that contain items actually removed from the input file (if you remove the -v, you’ll get all the items that were identified by deid but skipped). This command is followed by a pipe (“|”) to send the output to the next command

grep -v ‘^Patient’
This part removes lines beginning with “Patient”. These are just informational lines that note when a new patient record is being read. Output from this command is also piped along to the next command

sed -e ‘s/[0-9]*\t[0-9]*\t\([^\t]*\)\t.*$/\1/g’ -e ‘s/[^0-9A-Za-z]$//g’
This command has a bunch going on. The first part uses the “substitute” command to look for a string of digits, followed by a tab, followed by another string of digits, followed by a tab, and then capture anything that isn’t a tab all the way up to the next tab (got that? Good.). The captured text is then used to replace all the other “stuff”. This isn’t the place for a full regular expression tutorial, so we’ll leave it at that. Basically we’re removing a whole bunch of extraneous stuff so we’re left with just the phrase that was removed. The second part of this command cleans up any junk lines left over that we’re not interested in. This is then piped along to…

sort | uniq -i -c | sort -n | more
We’ll take all these at once since they’re simple. This set of commands first sorts the output (“sort”), then we get a count of all the unique lines (“uniq -i -c”), then we sort one final time so the list is in an order we can look at.

When you look at the output, you’ll get two columns: The first will be the number of times the phrase has been removed and the second will be the actual phrase. A good number of the early ones will be dates. If you want to filter those out, you can do so with some more creative “grep -v” statements. In general, you want to pay attention  to things being removed in large numbers in the initial phases of your analysis. But keep in mind, there are diminishing returns by repeatedly adding more things to the various lists and dictionaries. You are unlikely to get it perfect and it could take a lot of time. You can see examples of what these tables look like in the So Does it Work? section of my previous post.

Identifying Phrases that Should be Removed

Similarly to the above, it’s possible to view the things deid found but chose not to remove using the following code:

grep '#' | grep 'Name' | sed -e 's/[0-9]*\t[0-9]*\t#\([^0-9\t]*\)\t.*$/\1/g' -e 's/[^a-zA-Z]//g' | sort | uniq -i -c | sort -n | more

I won’t dig into the specifics as much on this one since it follows a similar pattern as above. Note that this is very specifically targeting things deid thinks might be names. These are likely to be the ones that require the most scrutiny. In general, the low-frequency phrases are the ones that require further investigation in this output (things skipped at a rate of one or two instances). That’s where all your rare or unique names are likely to be. In general the list is short (~1000 lines in our case) so it is possible to visually scan and identify any potential issues. One can then go into the original notes searching for the occurrence and ascertain whether or not it’s a genuine PHI issue. In some cases, you may need to modify some of the core deid lists for names. One example we ran into was that “tse” is a name but also an imaging term. Deid had “tse” in the unambiguous names list, so we had to bump this over to the “ambiguous” list to avoid false positives. Each dataset is likely to have such peculiarities.

Keep in mind: the one thing deid cannot do is report the names that it did not find because they are not in any dictionary and don’t fit any pattern. There is no automated solution to this problem.

Post processing

After a few runs where you’ve updated and added to your dictionaries, you will reach a point where you’re mostly comfortable with the output. We found the output a bit too specific, especially for cases where deid actually tells you a patient’s name is being removed (e.g. the “white matter” problem mentioned above). The following set of sed substitutions will shorten and further obscure the annotations provided by deid. Note that this runs against the actual output file (ending in “.res”). We don’t need dates, even obfuscated ones, so we strip out the “fake” dates supplied by deid entirely and replace with a simple “Date” phrase. Finally, it also replaces any long number with an “Ambiguous number” phrase instead. I’m hard-pressed to think of a situation where a long string of digits uninterrupted by decimal points could have any research utility. This provides a final filter on an mrn or phone number making it through the process.

sed -e 's/\[\*\*[0-9\-]*\*\*\]/\[\*\*Date\*\*\]/g' \
-e 's/\[\*\*Last Name.*\*\*\]/\[\*\*Last Name\*\*\]/g' \
-e 's/\[\*\*First Name.*\*\*\]/\[\*\*First Name\*\*\]/g' \
-e 's/\[\*\*Female First Name.*\*\*\]/\[\*\*First Name\*\*\]/g' \
-e 's/\[\*\*Male First Name.*\*\*\]/\[\*\*First Name\*\*\]/g' \
-e 's/\[\*\*Name.*\*\*\]/\[\*\*Name\*\*\]/g' \
-e 's/\[\*\*Hospital.*\*\*\]/\[\*\*Hospital\*\*\]/g' \
-e 's/\[\*\*Location.*\*\*\]/\[\*\*Location\*\*\]/g' \
-e 's/\[\*\*Medical Record Number.*\*\*\]/\[\*\*MRN\*\*\]/g' \
-e 's/\[\*\*Telephone.*\*\*\]/\[\*\*Phone\*\*\]/g' \
-e 's/\[\*\*Pager.*\*\*]/\[\*\*Pager No\*\*\]/g' \
-e 's/\[\*\*Initials.*\*\*\]/\[\*\*Initials\*\*\]/g' \
-e 's/\[\*\*Known patient lastname.*\*\*\]/\[\*\*Last Name\*\*\]/g' \
-e 's/\[\*\*Known patient firstname.*\*\*]/\[\*\*Last Name\*\*\]/g' \
-e 's/\[\*\*Street Address.*\*\*\]/\[\*\*Street Address\*\*\]/g' \
-e 's/\[\*\*State\/Zipcode.*\*\*\]/\[\*\*State\/ZIP\*\*\]/g' \
-e 's/[0-9]\{7,\}/\[**Ambiguous Number**\]/g' \
your_file.res > your_final_output_file.txt

Concluding Thoughts

For all six (two?) of you who’ve made it this far, I hope I’ve been able to provide you with an approach to use when running deid on your own data. It is very clear to me that the program is highly sensitive to the dataset you supply. Many of the things we encountered might not have been seen with something other than imaging data. Furthermore, as PHI goes, imaging impressions are actually pretty free of it compared to other types of notes. Still, once you get a feel for the kinds of decisions the algorithm makes, you can make pretty good progress tuning it for your needs.

This entry was posted in Tutorial and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s