Sharing Our Passion for Technology

& Continuous Learning

Reference Data Without the Database

By Paul Rowe | January 20, 2023 | Infrastructure

The data model for most organizations is built on the back of some amount of slowly changing, mostly static "reference data". For a retail organization this might be the list of Stores. For an equipment manufacturer, it could be the list of MachineTypes. In the healthcare space, this could be the list of ICD-10, NDC, CPT, or LOINC codes.

Whenever I am architecting a new project, I try to favor solutions that reduce the amount of "undifferentiated heavy lifting". This means I like to use AWS Lambda or containers for my services and DynamoDB, RDS, or another managed service for my database. Recently, AWS announced that they now have a serverless mode for their "OpenSearch Service". This was intriguing to me as a backing store for the microservices that host the system of record for the previously mentioned reference data.

As I was thinking about this solution, spinning up the database was easy enough using Cloudformation. The part that was more tricky was loading it. ETL pipelines to take reference data and load it into a database are not complicated things to build, but they are annoying to have to build. My service would host full-text search capabilities for the [National Drug Code (NDC)] lookup for a hypothetical Utilization Review company. As I was looking at the National Drug Code Directory database file, I noticed that it was only 26MB. This gave me an idea: Instead of loading this into a database that is running separately, we can load this into a local embedded database and package the whole thing as a Docker container.

This solution has a lot of advantages:

We don't have to build an ETL process to run in each environment. Instead, we can just make it part of the build process
Every environment (dev, test, prod, etc) will always have exactly the same data.
There is no need to manage or pay for a separate database service.
To update the data, we just build and deploy a new version of the app.
Reindexing operations are much simpler, we just change the way the index is built.

There are a few downsides to consider:

An update requires a deployment, so this solution requires a CI/CD pipeline that can be executed faster than the need to update the data.
The dataset has to be reasonably small, a few MB is fine, a few GB will probably make the container too big.

In order to support the full text search, I am going to leverage Apache Lucene. This is the open source project that backs ElasticSearch so it has a long track record for full text search. If you are new to full text search in general or Lucene based solutions specifically the Lucene documentation has a good overview. Since it is a Java library, we're going to use Java to build our container.

Initialize a new Spring Boot project using the Spring Initialzr. Select "Spring for GraphQL" and "Spring Web" as the dependencies.
Extract the downloaded zip.
Add dependencies on Lucene Core and Lucene Queryparser.
Create an NDCProduct class to represent each medication code, and a NDCDataset to hold the wrapper object in the JSON file.
Both indexing and searching will need a Directory to work with. This tells Lucene where on disk our data is stored. We can define that as a Bean.
We also need to create an Analyzer. This class is responsible for taking text, breaking it down into pieces and standardizing it so it can be searched. Both the indexing process and the searching process need access to it.
Create a class that will read the JSON file, convert each NDCProduct into a Lucene Document and add it to an index. We also serialize the entire document under the _source field so we can easily deserialize it later. We mark this class with @Profile("index") so that it will only be run when spring is started with the "index" profile.
When we are indexing, we don't want Tomcat to startup. We can create a application-index.properties file to supress that behavior.
Create a script to pull down and extract the database file. Then that script will run maven with the "index" profile
Create a GraphQL schema file to define our public API.
Define an IndexSearcher bean that will act as our entry point into our index.
Create a SearchMedicationCodesController that will be responsible for executing our search.
After starting our application locally using ./mvnw spring-boot:run we can execute a search for a medication by sending a request to the /graphql endpoint.
```
{
    searchMedicationCodes(query: "aspirin") {
        code
        labelerName
        genericName
    }
}
```
Finally, we can wrap everything up in a Dockerfile that is capable of build our application.

This project is a great example of how one "blessed stack" is not the best solution for all problems. Rather than blindly applying a single pattern to our microservice because it "works", we looked at the needs for our specific problem. Specifically, we identified that the amount of data we were dealing with was fairly small. We also observed that the data rarely changes. Full text search is needed so medications can be searched for so we looked for a good full text search library (Lucene) and then choose our language based on that solution. The result is a small code base that is easy to manage, easy to scale, and will allow a team to move on to other, higher value, work.

If you would like to checkout this example, it is available on Github

& Continuous Learning

Reference Data Without the Database

Recent Posts

Fine-Tuning Can Wait: What Enterprise GenAI Really Needs

Tales from the Trenches: A Go Optimization Example

How We Saved $10,000,000 Dollars A Year

Agentic AI Analytics: How to use an LLM to make sense of your data

Seeking experts to accelerate delivery & elevate your team?