Most people are probably generally aware that developing pharmaceuticals of any kind is an expensive and highly regulated endeavor, and that producing vaccines is even challenging.
Vaccines often contain attenuated viruses or bacteria, meaning they're altered so they give you immunity but not the actual disease, and thus they have to be handled under precise conditions during development and manufacturing. Components have to be stored at specific temperatures and any slight temperature variance will mean the batch has to be discarded.
The discard may amount to hundreds of millions of dollars in lost revenue according to George Llado, VP of information technology at Merck & Co.
In the summer of 2012, Llado was seeing higher-than-usual discard rates on certain vaccines. Llado's team was looking into the causes of the low vaccine yield rates, but the investigative approach involved time-consuming spreadsheet-based analyses of data collected throughout the manufacturing process. Data sources included process-historian systems on the shop floor that tag and track each batch. Maintenance systems detail plant equipment service dates and calibration settings. Building-management systems capture air pressure, temperature, and other readings in multiple locations at each plant, sampling by the minute.
Aligning all this data from disparate data sources and spotting abnormalities took months, spreadsheet storage and memory limits meant researchers could only look at a batch or two at a time. But Jerry Megaro, Merck's director of manufacturing advanced analytics and innovation, was determined to find a better way.
By early 2013, a Merck team began experimenting with a massively scalable distributed relational database. But when Llado and Megaro learned that Merck Research Laboratories (MRL) could provide their team with cloud-based Hadoop compute, they decided to change course.
Built on a Hortonworks Hadoop distribution running on Amazon Web Services, MRL's Merck Data Science Platform turned out to be a better fit for the analysis because Hadoop supports a schema-on-read approach which meant that data from many disparate sources could be used for analysis without having to be first transformed with time-consuming ETL processes to conform to a rigid, predefined relational database schema. That was the old approach used to source, transform and load data for enterprise data warehouses.
"We took all of our data on one vaccine, whether from the labs or the process historians or the environmental systems, and just dropped it into a data lake," says Llado.
Megaro's team was then able to come up with conclusive answers about production yield variance within just three months. In the first month, July 2013, the team loaded the data onto a partition of the cloud-based platform, and it used MapReduce, Hive, and advanced dynamic time-warping techniques to aggregate and align the data sets around common metadata dimensions such as batch IDs, plant equipment IDs, and time stamps.
In the second month, analysts used R-based analytics to chart and cluster every batch of the vaccine ever made on a heat map. Spotting notable patterns, the team then used R to produce investigative histograms and scatter plots, and it drilled down with Hive to explore hypotheses about the factors tied to low-yield production runs. Using an Agile development approach, the team set up daily data-exploration goals, but it could change course by that afternoon if it failed to find solid data backing up a particular hypothesis. In the third month, the team developed models, testing against the trove of historical data to prove and disprove leading theories about yield factors.
Through 15 billion calculations and more than 5.5 million batch-to-batch comparisons, Merck discovered that certain characteristics in the fermentation phase of vaccine production were closely tied to yield in a final purification step. "That was pretty powerful, and we came up with a model that demonstrated quantifiably that specific fermentation performance traits are very important to yield," says Megaro.
The good news is that these fermentation traits can be controlled and according to Merck, and it is clear that the new data analysis approach marks a huge advance in ensuring efficient manufacturing and a more plentiful supply of vaccines.
Dig data techniques adopted by smart and open minded teams of researchers are showing the way to reducing discarded vaccine components, increasing production, lowering overall costs and making vaccines more available for children all over the world.