(Huge thanks to for the video!)

The Future of Healthcare: Big Data, Predictive Analysis, and Machine Learning.

Jeff Pennington is the Senior Director of Translational Informatics at Children’s Hospital of Philadelphia. Jeff and his associates work in the Department of Biomedical and Health Informatics, a 60 person entrepreneurial tech organization within CHOP whose goal is to impact child health using technology.

Jeff’s goal for this presentation was to demystify CHOP as an institution by talking through some of the groundbreaking ways in which data is being used to impact children’s health, including:

The intersection of healthcare data and genomic data
Using social media to understand the adverse side effects that occur because of certain prescription drugs
Predictive diagnosis of chronic illnesses in children

Key Takeaways

While healthcare is finally in the digital age, the way healthcare is managed and collected digitally is still evolving rapidly.
The digitization of healthcare through the adoption of electronic health records, plus explosion in volume in biomedical data generated in the course of research, means biomedicine is extremely data intensive.

How CHOP is using data

Building large integrated datasets, and extracting info to make predictions and then translating predictions into the clinic.
Using social media to understand the adverse/side effects that occur because of certain prescription drugs.
Predictive Diagnosis - Figuring out how to detect "frequent flyers," children that show up on a repeated basis due to chronic asthma.

Video Transcript

Moderator: It's my very, very large honor to introduce Jeff here, Jeff Pennington. Jeff is the Senior Director of Translational Informatics at CHOP, which is an amazing thing that's going on right here in our backyard. I talked with Jeff last week about some of the things that they're doing, and I think to basically anybody, it would seem pretty close to magic. And a lot of us in the industry are worried about ROI and what it means for our clients and what does this mean, why are we doing certain things. I'm very jealous of their mission, because they have the mission of helping children, families, and communities, right? So imagine if your ROI was to help sick kids or sick families in the area? It's a very inspiring message.

So I will pass it over to Jeff right now to just say a little bit more about his background and what he does and how his department now contributes to all those things.

Jeff: So thanks, Keith and Jason and everybody who prepared this event. This isn't the kind of thing I do all that often. I'm typically up in front of a bunch of PhDs and MDs talking about specific research projects, not more broadly the subject of technology in healthcare and biomedical research. But I'm in the Department of Biomedical and Health Informatics, which is essentially a 60 person entrepreneurial technology organization within the larger 15,000 person CHOP organization. And our mission is to impact child health using technology.

So biomedical research and healthcare are really finally digital. In 2014, the adoption of what's called Electronic Health Records surpassed 75% of providers. And prior to that, if you went to your doctor's office, you were probably dealing with a paper chart. And maybe sometime in the last few years, you noticed that your doctors started using a PC in the exam room. Maybe the paper chart was still there.

But what that means is that when it comes to patient care, there's an exponentially growing volume of data that are being collected just in the course of regular health care. And that means that healthcare finally is in the digital age. And I put a little asterisk on that, because there's still paper out there, and the degree to which our healthcare is managed and collected digitally is still changing and evolving pretty rapidly.

The other thing that happened in 2014 that's really important to technologists is that the cost of sequencing the entire human genome for an individual - like my genome, your genome - dropped below $1,000. So that means your entire genetic blueprint could be generated in a reasonable amount of time, weeks, for under $1,000. And the Human Genome Project, which was the public-private partnership that sequenced the first human genome, finishing in the year 2000, cost hundreds of millions of dollars, and took a decade. All right, so in just 10, 15 years, there was a crazy Moore's law decrease in the cost of generating these genomic blueprints.

And so, when you look at the digitization of healthcare through the adoption of Electronic Health Records...by the way, the only reason that happened is because the government passed a law that incentivized this adoption. So financial rewards to adopt after a certain point, financial penalties for those that didn't, okay? So that whole arc, starting in 2009, took about five years. Plus, this giant reduction in cost and corresponding explosion in the volume of biomedical data that can be generated in the course of research means that now, biomedicine is incredibly data-intensive.

And so, I was at a biotechnology company before I came to CHOP in 2008 where we were dealing with this, but in early, early stages. And prior to that, I did my dotcom tour of duty. I was at Ask Jeeves, dealing with the data produced by internet search. There are some gray hairs in the audience, so you might remember Ask Jeeves. And then prior to that, actually in a full on basement startup, moved into an exposed brick space, and did that. So I never really thought I'd end up in a hospital.

And I think the introduction was great. I'm really interested in your questions, trying to demystify CHOP as an institution that's encased in these big, shiny glass-fronted buildings in West Philly. So hopefully, you'll have a chance to chime in. Anyhow, so when you think about the fact that biomedicine is now digital, there's this substrate, all of this data that we have available to us. What do we do with it? That means that the concept of applied data science in biomedicine is here and now.

And so, actually taking algorithmic approaches to extract information from all of this new digital data is incredibly relevant right now. The most important hires that we are making are mathematicians, applied mathematicians, people who are used to dealing with the intersection of computer science, domain knowledge, and math. That's how we think about data science. And the goal through all of...oops...the goal through all of this is to actually, as I said, our mission is to impact child health using technology, is to translate what we can predict into the clinic.

And so, if we're able to build large, integrated datasets because of...again, healthcare's digital, finally. If we're able to actually extract information and make predictions using those data, then we're in a position to actually translate into the clinic. And I'll talk through three vignettes that cover these themes. But those are the larger themes.

Moderator: Hello. Before we get started, Jeff, can you talk about how HIPAA applies to the data and all these huge data sets that are coming in from the doctors and if you're on the processing end or analyzing end, what are the concerns for your department there?

Jeff: So that's a great question. HIPAA's kind of a buzz kill word in a lot of ways. It's something that people will drop when they want to shut the conversation down. "Well, HIPAA..." "Okay." What's wonderful about, again, our department is that we are on the research and development side of the house, where there are different rules for compliance and guidelines for doing the right thing. HIPAA covers operating clinical organizations, hospitals.

But when there's something called an Institutional Review Board involved, there's a risk versus reward calculation that's made. And when the rewards outweigh the risks, we are free to work with these data sets on a research and development basis. Let's say we develop some predictive application that then gets translated back into the clinic. Now that application is subject to the rules of HIPAA, which are really about protecting the patient and the institution from risk - privacy disclosures and those sorts of things.

Moderator: I think we'd like to see that first example.

Jeff: Don't read too much into the slide. The first example has to do with this intersection of digital healthcare data and genomic data. So something that I had to learn when I came to CHOP was that academic research in the ivory tower is a little bit of a myth. There are a lot of researchers who are driven by "publish or perish." They're very competitive. But there are many more researchers, especially in pediatrics, who are very interested in working together with their colleagues in very open ways.

And so, we've recently launched a partnership with a cloud services company called Seven Bridges, a cloud-based genomic data analysis platform that is meant to be wide open to anybody to show up and work. And so, the intent is that people with skill sets and expertise that aren't traditionally applied in healthcare could have data, plenty of data, that's open and freely available in the context of this cloud platform, and an analytics space to actually to do analysis and share results. And this is focused on children's brain tumors and it's a place where people are very interested in collaborating, because there are probably a dozen different high-level brain tumor diagnoses that a child could receive. But then when we actually dig into it, those tumors are very heterogeneous. There are new types being discovered constantly. And so, you'll never have enough of the same kind. Again, with this asymptotic...this increase in the differentiation and people's understanding of tumors...you'll never have enough of the same kind to do the kind of study that you want to do, so you have to share.

And so, this platform is a place where we've partnered with a private entity. They happen to have a contract with the National Cancer Institute for adult cancers. We are trying to deal with the fact that only about 5% of federal spending goes to pediatric cancers by investing in this cloud platform. There are a number of foundations, patient advocacy groups that are backing this initiative up and are helping to pay for it and make it happen. But the blue sky dream that we have is that somebody who's working on some sort of ecommerce optimization and has an algorithm for frequent item set data mining would show up and work in this environment and make some discovery about how certain genomic transcripts happen together, and that kind of discovery would be a contribution to the field. So again, public-private partnership or at least academic-private partnership, an intent to free the data, and have an open data environment for this kind of research, and expectation that people with skill sets that we hadn't really anticipated could come in and work with these data.

Moderator: I'm going to use a buzzword. Does your department basically "dematerialize" the experiential phase of stuff? So instead of having to go collect data, people are able to use your giant store of data to find their own conclusions in there.

Jeff: Right, that's a big part of why this kind of sharing is important. We want to take the costs of acquisition, both in terms of dollars up front, and all of the regulatory patient privacy consent, the overhead, really, that's required to accrue these kinds of data resources, take that on, on behalf of a community that could then use it. So I'm not familiar with dematerialize. It's a new buzzword to me. But I think maybe that's what we're talking about.

Moderator: Yes. Definitely. I'm going to open it up. Does anybody have any questions for Jeff before we move on to number two?

Man 1: Do you feel we're at a point where the data and the tools are to the point where they can actually take advantage of all the data that's out there? Or is that something that's still developing, and how long do you think that will take where you can actually get strong enough software to really make a difference?

Jeff: It depends. I'll say that the fact that the cost of compute's so low, and the fact that there are more and more people trying to create algorithms and then, software implementations of algorithms, to analyze in a less structured, non-linear way large datasets that we're getting closer and closer. We are now at the point...and again, I have a skewed view on this, because my team and my organization's job is to be in that zone of uncertainty where there isn't a clear way to do something. We're figuring it out as we go. And that's where maybe we'll pilot something using R, which is a statistical analysis package. But that'll probably break down pretty quickly, and we'll start writing Python, using some of the analytic libraries that are available in Python.

There's a new grant program from the NIH to actually create software, recognizing that there need to be more mature methods. So the federal government's devoting, at this point, about $100 million a year to software development in this context. We were in a meeting today talking about, "Okay, we're basically pitching the NIH to fund us to develop a new software method." So that's a bit of a roundabout answer to your question. But I'm more hopeful now that we'll end up with some more stable software methodologies or software implementations for some of these things. And part of my job is to bring technologists into healthcare and do exactly that.

So right now, we have about 45 or 50 projects in our GitHub repository. And a big part of our job is to do that, it's to package it up and open source it and move it on to the next phase, which might be somebody else using it and contributing, and then maybe even somebody picking something up and commercializing it. That actually happened this year. A company out in Montana called Golden Helix took one of our open source projects, which we licensed under the BSD license, which pretty much says, "Just don't sue us, and do what you want," and they built a commercial platform, are marketing it, and I think that's great. Hurts a little bit to see it out there. But it wouldn't have gone anywhere without that.

Man 2: Yeah, just interested to hear how the organization has changed or planned for this move into big data? Thinking 5 to 10 years ago, it was a medical facility with medical practitioners but then, building out a whole capability of data management, from servers and the processes required to be able to translate that information into something usable. How have you seen the organization shift to be able to accept and roll out this move into big data, and to be able to use that data properly?

Jeff: Yeah, that was an early prescient decision made in 2006 to make a pretty big capital investment in compute capacity at CHOP. So there was a contract negotiated with...I'm gonna blank on the name of the data center in Norristown or Bethlehem. There was an investment made and a team adoption of VMware to virtualize, adoption of EMC for network attached storage, and investment in a high-performance compute cluster. And what that did was allow everybody who'd been working on a server under their desk to get out of that server and move into an environment where they could find the right scale for the job that they were doing. I showed up at CHOP in '07 and the biotech company I was at before that, so if we wanted to start a new project, the hardest question was "Okay, where do we get space? Where do we get a slice of some existing utility?" And when I showed up at CHOP, it was, I put a ticket in and I got a virtual machine. That was just fantastic.

Now, I'll say we were early on that, and that worked out really well. But now we've fallen behind, because I'd say three years ago, we should have been taking a cloud first approach, and we should have stopped with large capital acquisition. We should have started figuring out the business case to pay as we go in Amazon or other cloud service providers. So we're really just trying to figure out how to restart that. This cloud data-sharing project is one where we just went ahead and did this, independent of the institution and the institution resources. And now, the institution's learning from that. And so, we'll catch back up.

Moderator: When I spoke to Jeff the other day, he told me about a really interesting use they have for the open source data from Twitter, which I think Twitter right now is trying to prove its value, and this is the most valuable thing I've ever heard Twitter be used for. So I will let Jeff explain what they're doing with that data.

Jeff: So moving from big data acquisition and management to more of a data science application. So we have, in the pediatric population, a really, really high rate of prescription of drugs to intervene on behavioral conditions: ADHD, ADD. And there's a lot of evidence out that there's overprescribing. Kids are getting medicated at the drop of a hat. There's a lot of popular press about that. And so, these drugs were approved for use under certain conditions that maybe we're drifting away from, especially in the pediatric population. And so, we're trying to develop an understanding based on our patients' experience of the adverse events, the side effects that occur from the use of these drugs, using social media.

So the traditional way to approach this would have been to figure out how to contact 100, 1000, 5000 patients and interview them and they'd fill out a questionnaire. What we're doing is sampling the Twitter-sphere, the stream of tweets, at a pretty reduced rate to collect an actually representative sample of tweets. We use natural language processing methods to extract the report of adverse events like headache, nausea, some sort of sleep disruption, from tweets. And our interest is in correlating the rates of those side effects, like how often people report them, to what's been discovered or described under more formal conditions. And so, there's a team of people who hooked up to the Twitter API, dumped, are still dumping thousands of tweets into MongoDB, which is a NoSQL database that's good for this kind of thing, and unfortunately, still having to do the more traditional, gold standard labeling process of identifying all of the different ways that somebody could misspell Adderall, all of the different Urban Dictionary lookups that are necessary to find the slang or short names for these drugs, and then actually classifying a sample of tweets as being truly adverse event reports or side effect reports versus not.

And you mentioned that this might be one of the most useful things that Twitter's been used for. This kind of work is happening in a lot of different conditions, by a number of different researchers around the country. What we're trying to do is take something that's very specific to the pediatric population and do the work there. One thing we've learned is that it's really amazing what people will tweet about. And unfortunately, these drugs are used heavily not for their behavioral conditions.

Moderator: Do the people that you're making recommendations to, based on the conclusions here, do they mind that it wasn't clinically collected? Or are they okay with your methods coming from social?

Jeff: So we've only done an anecdotal exploration of that with patients. So the clinicians that we worked with on this project are pediatricians, so they see patients all day, every day. And so they've talked to some of their families about this, and the families are okay with it. What's interesting about this is this population health. We're trying to generate these incidence rates at an aggregate level across a broad population.

And so, we're not going back to the @so-and-so's Twitter account and saying, "Hey, saw you reported an adverse event. You might want to consider this adjustment to your medication." It's not that kind of individual level interaction. But again, it's an example of using social media as a method for patient report of these kinds of conditions. And this kind of patient engagement is something that we're all very excited about, because people are sharing more and more, and people are more and more interested in having some more meaningful interaction with their healthcare providers than just an annual physical or seeing them when they're sick.

Moderator: Let's talk about the predictive aspect of what you do, and the applications that you've been working on to help...

Jeff: Actually do something with all this, yeah.

Moderator: Exactly.

Jeff: So the final vignette, or the example I'll talk about, is in something that's all about Philly. So about 25% of the kids in Philly are affected by asthma or an asthma-like condition, which is just obscene. So one in four kids in the city and county of Philadelphia are walking around with asthma and maybe an inhaler or maybe they should have an inhaler. And about 20% of the kids that come into the CHOP ED...I'm sorry, 20% of the Emergency Department visits at CHOP are from about 5 or 6% of the patients, and the primary complaint is respiratory distress.

And what we found is that looking back at the electronic record of all this care, what we found is that these kids are coming in to the Emergency Department to receive what should be chronic care that they get from their pediatrician. And there are many different reasons for why they're using the Emergency Department as their primary care provider. But what we're doing right now is figuring out how to detect, based on the population of all of the Emergency Department visits that happen, how to detect or to look for what we call them "frequent flyers," children that show up on a repeated basis. And we're trying to tease apart what the most significant pattern would be that would identify a child that's having chronic asthma, and they're showing up in the ED when they're having an acute attack. And the intent is that when we can detect that, effectively flag this child, we can instead of saying, "You really should follow up with your pediatrician.

You really should go to your primary care doc," which they probably haven't been doing, we can prescribe them a controller medication, medication to treat their chronic condition, and hopefully then reduce the rate at which these frequent flyers show up in the Emergency Department.

So that's really important for the kids, because these acute episodes are incredibly disruptive to their lives, their health, and are associated with all sorts of bad outcomes. And it's important from a health system's perspective, because that represents...the most expensive interaction you could have with the healthcare system is in the Emergency Department, absent some crazy surgical procedures in terms of the hours that are spent, and the cost of the services of the Emergency Department is way up.

Moderator: Okay, guys. Well, unfortunately, we are out of time. I do want to just give another huge thank you to Jeff for being here. And Jeff, if you had to name three specialties that you're looking for, three fields that you're looking for help from, three specialties, what would you say?

Jeff: Yeah. DevOps, anybody who's doing applied math, and really data savvy software developers.

Moderator: Awesome. Thanks again.