- Earlier this year, Partners HealthCare gave the industry a glimpse of its ambitious plans to create a combination big data storage warehouse, research workbench, and analytics environment, based on data lake technology, that could support research into precision medicine and the benefits of the Internet of Things (IoT).
The project, called the Integrated Data Environment for Analytics (IDEA) platform, is the Boston-based healthcare system’s bid to provide a secure, trusted, customizable epicenter for the advanced research and health IT development that takes place within its many prominent divisions, including the Center for Connected Health and Massachusetts General Hospital (MGH) Cancer Center.
With the goal of fostering innovative big data analytics without requiring each project to build its back-end infrastructure from square one, the IDEA platform intends to accelerate the pace of discovery with flexible architecture and near-limitless capacity to expand and grow with the needs of its users.
“What we’re trying to enable is research,” Brent Richter, Associate Director of IS Operations and Director of Enterprise Research IS, said to HealthITAnalytics.com in March. “There is a lot of talent in the Boston area, and thousands of researchers working on similar projects. It doesn’t make sense to have them all constantly reinventing the wheel.”
“Instead of taking an app-by-app approach like some other organizations, we are building a platform that provides a foundation for different groups to leverage in a secure and private way.”
At the time, the IDEA project was in its early phases, with development running parallel to Partners’ $1.2 billion effort to roll out an electronic health record system from Epic across its member organizations, including MGH, Brigham and Women’s Hospital, and Spaulding Rehabilitation Network.
These big changes haven’t disrupted the progress made by Richter and Dave Dimond, Chief Technology Officer for EMC’s Global Healthcare Business. In fact, the short summer months have brought an explosion of interest and the first promising pilots to the IDEA environment, including a foray into further developing the healthcare Internet of Things.
“The core data lake is really becoming an advanced clinical research information system to accelerate the speed to knowledge,” Dimond said.
“It’s very powerful technology that is a key part of retooling how providers are engaging patients at the point of care, for example, and how they are building a bridge between them. It’s about creating a learning health system – that’s really at the heart of this project.”
Rethinking the possibilities of storing big data
Data lakes differ from traditional data warehousing technologies by storing vast quantities of raw information – structured, unstructured, and everything in between – in its native formats in one centralized repository. They are driven by a semantic approach to data storage, and may also be known as graph databases.
Instead of curating each individual dataset for a specific purpose and storing it only in that customized format, the data lake allow analysts to take an ad hoc approach to choosing which data to use, reusing it when desired, and combining disparate sources of information in ways that may not have entered the imagination when that data was collected.
Data lakes may be especially suited to the healthcare environment due to the highly variable nature of data generated by electronic health records, insurance claims, medical devices, Internet of Things tools, and emerging competencies in genomics.
The platform consists of two main components, Richter explained. One is focused on providing storage capabilities for the enormous datasets involved in translational research, precision medicine, and patient-generated health data, while the other is an adaptable analytics sandbox for developing decision support tools and conducting investigations into clinical questions.
“When it comes to storage, the key is flexibility and keeping data somewhere that has an eye on the future,” Richter said. “We’ve set up storage services with EMC to leverage elastic cloud storage (ECS), which is an object-based storage system.”
“This allows flexibility for cloud-native applications, which is important for all our users, including radiology groups, pathology groups – there’s clinical sequencing and research sequencing data in there, too – all these use cases can take advantage of the storage services.”
The modular structure of the storage capabilities means the capacity is “basically unlimited,” he continued. “We’re starting with about a petabyte right now, and we can easily scale that to four petabytes based on future needs, but the theoretical expansion is in the hundreds of petabytes.”
“We want to develop scalable solutions, which will allow us to tailor the system for different approaches. We don’t have very long procurement cycles with EMC, so we don’t always have to spec and plan extremely far in advance,” he said. “We like to work in a modular fashion, which lets us build new capacity without adding prohibitive costs. If we need to make changes or expand our capabilities, we can have it on the floor and plugged in within a few weeks. That’s how quickly we can make it happen.”
Crafting a community around the Internet of Things
Speed isn’t just important when free gigabytes start getting hard to find. It is also key for supporting IDEA users with complex research and big data analytics projects on their minds.
In addition to the MGH Cancer Center, which is using the platform to store data and integrate it with additional clinical datasets, the Center for Connected Health at Partners is taking advantage of the opportunity to advance its work with the growing Internet of Things.
“We’ve carved out virtual machine (VM) environments for them, and they are starting to put data in there for their initiatives,” Richter said. “This includes a lot of Internet of Things data from activity trackers that will be used for research and also for voluntary connected health studies.”
“They are organizing that data within the IDEA environment and using MongoDB, then they are starting to build applications on top of that for analytics and visualization of the data. Going forward, they are going to use that as their platform as the base for all their development.”
Not many organizations have cracked the problem of how to centralize the collection of Internet of Things data, but the data lake approach may be ideal for holding the messy, unstandardized datasets generated by consumer-grade and medical-grade devices.
“There is a tremendous amount of interest around the world in the Internet of Things – the Internet of Healthcare, really,” said Dimond. “That’s the tagline for Joe Kvedar at the Center for Connected Health. But all that data will require a landing zone, and I think we’ve sort of decided that can’t be straight into the electronic health record.”
“Because you can collect all this patient-generated health data from all these sources, but you’re not necessarily sure if you’re going to need that data or not. And we’re already struggling with trying to figure out what is clinically relevant and what is extraneous at the moment.”
If the IDEA platform can become a secure, consolidated collection point for IoT data, “we can start to prove some of the tools to do analytics on the data, and enable specific research projects, but still keep that huge pool of data around for future uses,” said Dimond.
Enabling translational research and precision medicine
Translational research is also a prime target for the Partners community. At the Brigham and Women’s Hospital Emergency Department, investigators are looking into how big data can be converted into new care guidelines and clinical decision support.
“They are at an early stage with this, but we have also carved up a VM for them, and they are using the APIs of a few GE platforms to collect telemetry data from devices like EKG machines,” Richter explained.
“Normally, that data is kept around for a couple of weeks, maybe, while the patient is in the hospital, but it’s only there for clinical use. Once that use case has run its course when the patient is discharged, it is basically just overwritten.”
But if this data is moved to the data lake for longer-term storage, it could be used for research purposes even when the clinical question is no longer top priority, he continued.
“They’ve gotten some very good results with using clinical decision support to understand the pattern of medications that might be ordered when a patient shows up in the ED, for example, and figuring out how to use that data to start looking at developing care guidelines. They have also been able to look at predicting length of stay from admission data.”
“Within the IDEA platform, they have set up the environment so they can start ingesting all of that data for research purposes. Once they have done that, it’ll translate very quickly to decision support and predictive analytics for the emergency department, which is a very exciting concept.”
The partnership between machine learning and data lake technology will be the key to getting “to get really tangible, clinical knowledge out of resources that are previously untapped,” Dimond added.
In turn, that will allow researchers – and clinicians at the point of care, eventually – to access deeply personalized insights for individuals with complex needs.
“When you think about precision medicine, what you really want is an online shopping cart experience,” said Dimond. “You want to see a box that says ‘researchers like me are looking into this’ or ‘patients like her have tried this therapy with these results.’ It’s about personalization, just like what the big retailers are doing.”
“And you want to be able to team up, collaborate, and crowdsource so you can move the whole community faster. To do that, you need to create platforms that offer data as a service. Then you can start to create residual information from your initial analytics, and make that available to researchers, too. You can build apps and data products on top of it.”
“That’s how we’ll get to precision medicine,” he asserted. “By creating enablement platforms that will drive research without requiring every single team to build everything from scratch each time they have a question they want to answer.”
Reaching that goal will require a multifaceted approach to personalized care, which will likely include patient-generated health data from Internet of Things devices alongside genomic sequencing results, clinical data from EHRs, historical population health data, and additional big data sources that have yet to be imagined.
Looking towards the future of data-as-a-service
Data-as-a-service platforms like IDEA will streamline the process of generating actionable insights, which will make it easier for researchers to crunch the numbers, report on their findings, and take the next step into the clinical setting.
“Not only does IDEA offer enough storage that is already integrated with the tools, but they can very easily add on new capabilities if they come up with new use cases, whether they want to use a certain vendor, take an open source approach, or develop it themselves,” said Richter.
Datasets can be stored in the system, allowing other researchers to access and reuse previously collected information and insights. Combined with available data from places like the National Library of Medicine and the GenBank genetic sequencing database at the National Institutes of Health, researchers will be able to utilize all the data they need within a single, familiar environment.
“We’re not building separate environments in a traditional IT manner for each customer,” Richter said. “The platform is there. The security is wrapped around it. The privacy is there. But if there are datasets that can be shared or need to be shared, they can either make them public or name specific users that are allowed to access that data.”
That means more time for study and less time spent on project planning, Dimond said, while fostering a better user experience.
“One of the most important things about treating data like a service is that you get into this environment where it starts to be more automated, and the end users and the research groups aren’t being asked what they need,” he said. “It’s there in advance of demand. It’s particularly important to make sure our analytics users have a very positive experience, and that we can support complex graphical computing and high-speed analytics without those really frustrating delays.”
The next six months are likely to see just as much rapid growth and development as the initial launch period, Richter added, and IDEA will start to produce even more meaningful results.
“I’m very excited about where we’ll be in the next six to twelve months,” he said. “Back in March, we were in development of the platform with just a couple of very early pilot projects. Now we have a lot of our departments and groups developing their own applications, learning about the range of capabilities available to them, and using our storage functions.”
“A year from now, we hope to have results from that. Those groups will have their applications. We’ll have the products. We’ll have the analytics pipelines set up. And we don’t just be doing actual research, but we’ll have the results of that research to feed back into the data lake and contribute to future efforts. So we are all very excited about that.”