Why does Clear.Dental (ab)use git for its database?

This page will attempt to answer the most controversial decision that was made for Clear.Dental: using git along with a combination of text files to store all of its data. This is by far the most common question I get when people learn about this project. In order to properly explain this decision, we have to go over my (Dr. Shah’s) philosophy, my experience in software and dentistry, and what my goals are with Clear.Dental.

My (Dr. Shah’s) Philosophy

I have been using Open Source Software since 2004. For me, the most important aspect of using open source software is the ability to understand how the software works and having the ability to makes changes when needed. If a project is open source, but nobody can understand how the code works, then the fact it is open source is rather moot since nobody else could change the code in a meaningful way.

Normally, open source tends to benefit only software engineers who develop and edit the software, rather then end users. However, I think this philosophy should be extended as much as possible to end users. I am not expecting end users to be able to write code for the software they use, but they should have a good understanding of how the software works and it gives end user more agency in terms of diagnosing problems with their computer.

Therefore, whenever I was tasked with a design decision, I tended to push for a simpler design simply because I want to be able to explain these decisions to an end user as well.

My Experience in Software

Back in undergrad, many of my friends would ask me SQL related questions since I already took a course that covered it. These were computer science majors that understood fundamental principles of computer science better than myself and had no issue working in working in different programming paradigms (like learning Java, then C, then ML); but they still had a lot of trouble with SQL. I made me not only appreciate the work that database engineers do, but it also made me realize that managing any kind of relational database is a very complex task. If I had trouble explaining the difference between a left join and the right join to a CS major, what chance do I have to explaining the basics to doctors? And this is not getting in to the vendor specific issues like stored procedures and optimizing the queries and insertions.

My Experience in Dentistry

I've worked (as a Dentist) in just about every kind of dental office out there: big corporate run dental offices, ones that were run by the state government, community health clinics, and small town single doctor clinics (I am now running my own dental practice).

One thing I noticed when working as an associate dentist in these locations is that their own “IT Person” is always working at some other location or some other customer. Because of this, whenever these practices have an IT issue, it will often be a day or so before the technician can come in and resolve the issue. But if this is a critical issue (like not being able to connect to the database), then the doctor would be forced to either reschedule the appointments or to write everything down on paper until the system works back up. Lots of these practices look at me with my CS and IT background and ask “Hey! Can you fix our problem?”. The answer had to be “no”; not because I didn’t want to, but because fixing connection and authentication issues is by no means trivial and the last thing I want to do is make the database even worse than it was before. It is very rare for somebody with a regular IT background to walk in to a dental office and fix IT issues unless they have special knowledge of how Dentrix or Eaglesoft works behind the scenes. So I decided that I should make a system that is dead simple for anybody to understand and takes very little training to know how to fix.

My Goal in all of this

I want doctors to get more involved in how their software works. I believe that having a simple filesystem as a database would actively encourage doctors to see what is happening behind the scenes of the EHR that they are using. Here is one example:

[
	{
		"AllergyName": "Penicillin",
		"AllergyReaction": "Severe"
	},
	{
		"AllergyName": "Nickel",
		"AllergyReaction": "Mild"
	}
]

I ask you this: how long do you think it take for you to teach a doctor what this means? How long do you think it would take to teach a doctor to edit this kind of file? Compare that to how long it would take for you to explain to any doctor the basics of a relational database (or any other kind of database). For me at least, allowing any end user doctor to see and edit the data directly is paramount.

Overview of the solution (see more information here)

  • Each "trusted" computer has a git repo. Each operatory has a trusted computer. It is recommended (but not required) to have a “hub” computer to act like a pseudo server. However, it simply holds the “bare” repo and its primary purpose is to make it easier for all other trusted computers to pull the updates from a single source.
  • The git repo has the entire patient database along with information about the providers and the practice itself.
  • For the UI layer, any data I/O is done on the local disk. This ensures that even image files are loaded at instant speeds.
  • Each change to any file is followed-up with a git commit along with a comment (generated by the UI layer) on why the file was changed.
  • This change is immediately pushed. If a central computer is used, the commit will only be pushed there. If a strict decentralized system is utilized, then the commit is pushed to each of the trusted computers.
  • All of the pull/push is done via the ssh protocol (git has built-in support). Each computer uses a RSA public key to be able to log in to another computer. This allows all the updates to be transparent.
  • Each minute (via a chron job), each computer will pull whatever update is available. This ensures that whatever is shown on the UI layer is at most 1 minute old. Any git pull can forced before the 1 minute mark.

Common Lies I have seen in the dental technology world

Lie #1: Putting all the data in the cloud resolves all IT issues

The general trend right now is for all doctors to put all of their patient information on to a 3rd party “cloud” service. This services give no option to self host, make it difficult for doctors to transfer the records from one service to another, and force the practice to pay the monthly subscription or risk breaking their local state law.

Lie #2: The network never goes down

The network always finds a way to go down. Yes, most of the time, you can just reset the router and that will resolve the issue. But that will not always resolve the issue. For example, if the internet service provider goes down and the practice relies on a pure cloud platform, then there is very little a practice can do to get access to the patients’ records. Better yet, if the service itself goes down, no additional internet connectivity would save the practice; they would have to reschedule all their patients for the day until the service goes back up. Cloud based vendors will claim that this never happens. This is a lie. It has happened with me and I had to face this exact issue. There should always be a backup just in case something goes down. With Clear.Dental, even if the GUI breaks down, the text files will always be editable.

Lie #3: You need a high end server with massive amounts of storage to store and serve the data

What is patient data? There are really 3 main categories for data: text data (case notes, medical history, etc.), images (radiographs, scans, etc.), and 3D data.

The first category, text, takes up very little space. Each case note is about 1 kilobyte. Lets say a practice has 5 providers, each provider sees 10 patients a day, and works 260 days a year. We are looking at around 13 MB per year. This is a trivial amount of data.

Second is images. Each intraoral radiograph (stored in lossless compression) is about 1 megabyte. Assuming that you do a FMX (18 radiographs) on each new patient, and then 4 bitewings all subsequent years, and the practice gets 100 new patients each month (which is an extremely large amount for a single location), the practice would need about 26 gigabytes of data per year. At first, this does appear to be a large amount of data that has to be stored on each computer, but take in to consideration that the price of storage has generally been cheaper each year and that this would be a extremely profitable practice that does not have any cash flow issues. There is no reason why a normal SSD can not hold all this information. Because this is part of a normal filesystem, tricks like LVM cache can be deployed with mechnical hard drives alongside SSDs.

As for the third category, the 3D data normally has to be copied to the workstation computer anyway. Most patients do not need a full CBCT for each visit and therefore will take a minimal amount of storage.

Lie #4: Storing in data in text files makes any kind of search very inefficient

For Linux at least, the file system is rather efficient in terms of finding a file and reading it. Also, parsing .json and .ini is now relatively fast. Just to give a an example: using a AMD Ryzen 7 1800X and using a single threaded for-loop, and going through 165 patients, reading each of their ledgers (after parsing the .json file), adding up the balance for each patient (including write-offs) to calculate which ones have a balance, it takes the system grand total 6 milliseconds to do all of this. Even when testing in situations where the practice would have 20,000 patients (each with a long ledger), it did not take more than 300 milliseconds to generate this same report. There are separate files for each of the attributes for each patient (a different file for treatment plan vs. the hard tissue chart) and these files do not take long to read.

Interesting side effects in all of this

  • By having git keep track of who made what changes, it not only adds accountability in the practice, it also makes it HIPAA compliant by having an audit trail of how the changes were made over time.
  • Since a git pull or push only sends the changes (rather than the whole repo), it becomes very easy to create a incremental backup system for an offsite server.
  • Because the .git directory holds all the changes, it technically keeps an extra copy of the entire repo. That way, if somebody mistakenly deletes data, there is a way to retrieve it back. But this also means the storage requirements is essentially double the normal size of the filesystem storage. This also means in case of a disk failure, there are other computers with the full copy that can be used as backup.