ISPP 2015

Career Opportunities in Pharmacy
Reproducible Data Analysis in Pathology and Lab Medicine (Stephan Kadauke, Amrom Obstfeld)


okay everyone it’s great to be here
just a quick introduction some of you may not know who we are
this is Stephan Kadauke he is a assistant lab director over a CHOP also a
member of the division of pathology informatics also at chop and is the
course director the course that we’ll be describing to you reproducible clinical
data analysis with our and our studios also a certified instructor actually in
our and for those who don’t know me I’m also a director over at shop and
director at the pathology implements division and a director with Stephan
so what we’re going to be talking to you right now is how we all analyze data and
how the analytical tools that we choose to use as we analyze our data impact
reproducibility and really the overall validity of the analysis and when I use
the term reproducibility in this context what I mean is the ease with which
another individual can take that raw data that you started with and recreate
the same exact statistics summaries plots and hopefully come to the same
conclusions which is really what we’d like to have from our analyses so let me
illustrate the importance of this concept of reproducibility using a
couple of real examples of what happens when analyses are not done with this
concept in mind so the first example comes from a scandal involving two
scientists from Duke University in Neil potti and Joseph Nevins they claimed
that they had identified gene signatures in cancer cell lines that predicted
patient response to therapy and they published this finding in all the
greatest journals that that we have and they sort of essentially had that holy
grail of personalized medicine within their grasp at least that that was the
claim however they had made several sloppy errors in their Excel worksheets
which ultimately led to an attempted cover-up all of which was unraveled in
public by a couple of biostatisticians who were trying to reproduce their
results and the outcome was that cancer patients who had enrolled in a clinical
trial based on these gene signatures almost received the wrong therapy and
many of these very high impact articles had to be retracted but this is not
something that impacts healthcare it really is something that that extends to
all fields this one coming from the field of economics Reinhart and Rogoff
are two economists who still work at Harvard they published a very
controversial paper the claim was they had identified a relationship between
high national debt levels and reduced economic growth and this was coming out
during that Great Recession and was used as evidence by conservative policy
makers really worldwide who believed that reducing government spending would
actually be beneficial in terms of getting getting their countries out of
recession so that was their claim however it turns out that there was an
error in their work sheet was uncovered by actually a grad student at University
of Massachusetts and he only found it when he had insisted on getting the raw
data from Reinhart and Rogoff and as it turns out the conclusion wasn’t quite
accurate there is a negative relationship between national debts and
economic growth but was it doesn’t actually lead to contraction as Reinhart
and Rogoff had claimed and the outcome of this error really was that austerity
the policy known as austerity was used as a kind of a tool to drag these
economies out of out of recession perhaps unnecessarily so you know these
are sort of well known publicized examples but you know the question that
I’d like to ask somewhat controversially is are you know are we really immune
from these kinds of issues and I’d say that the answer is probably no so who
hasn’t been in a position when they’ve been doing an analysis and they’ve asked
themselves where do I get these data from how did I pull these data if I
wanted to get them again would I be able to do it in the exact same way how did I
create this plot exactly what were the steps that I took so that I came to this
conclusion how did I process the data right data often needs to be kind of
moved around a little bit before it’s ready to be analyzed and you know why
did I decide to omit some of these outliers right what about them led to
that and am i doing it in a rigorous Manor why does this happen if you think
of this as being kind of a prototypical data analysis project it’s typically
gonna look like something like this you’re gonna begin by defining your
goals the objectives of the project and identify what data you’re gonna need to
meet these goals you’re gonna bring the data into your analytic software in some
way and you’re gonna tidy that data by which I mean you’re going to reshape it
you’re going to clean it get it into the right format so you can do your data
analysis and that point you’re ready to really begin to understand what’s in the
data you’re going to try to extract that information and that usually goes
through some kind of iterative cycle involving transforming the data making
plots and modeling the data and then the final piece is communicating your
conclusions to two colleagues so generally this process occurs as a
collaboration often there’s an attending who’s supervising a trainee there might
be a PI who’s supervising a student and the domain expert the supervisor plays a
dominant role in defining what the project is about gathering the data set
that needs to be analyzed and then handing that off to the individuals
could be doing the analysis the student or the trainee so this might be a lab
director for instance who’s providing a list of patients maybe who have had some
kind of error a transfusion reaction and then that students going to engage with
the data with intermittent communication back with the with the supervisor when
Excel is used as the medium for this process the workflow will involve points
in which files need to be shared what actually is going on within the context
of the analysis kind of becomes a black box so we’re essentially doing is we’re
effectively creating silos of data analysis and what’s going on on either
end on either side of that of those walls becomes a black box and that’s
what we want to avoid what we’d like to propose is adopting a new workflow in
projects that involve data analysis and this would be a more reproducible
workflow which will allow collaborators to analyze projects more effectively and
more closely in a more transparent way by removing some of the barriers that
are imposed by using Excel and this workflow is really made
possible by eliminating spreadsheets as much as possible and adopting a new tool
known as computational documents which Stefan is going to describe in more
detail so key technology for reproduce herbal data analysis or data science is
called the computational document and what’s a computational document a
computational document is simply a document that has executable code inside
of it so before I show you what a computational document a document could
look like and how it fits into our rep reducible workflow let me introduce a
few terms I’ll be talking about so the first is our where’s my pointer here the
first is our our is a programming language for data analysis and we use
our for lots of reasons including that it’s free it’s great for wrangling data
and it creates great graphics also you can use R to pull data from any of the
databases that are being used at pen and Shop such as epic Cerner and soft it’s
really just a matter of the permissions that need to be granted by the
respective database administrators and you can pull your data finally end up be
showing some data to support the statement doing basic data analysis with
R is actually not that difficult to learn even if you’re not a programmer so
then there is our markdown hour markdown is a computational document format that
has executable code in it that is usually written in our I say usually
because it’s actually possible to write code in Python C++ and a bunch of other
languages inside of an are marked on document but
by default we use our in our markdown documents as executable code finally
there is our studio our studio is the name of a company and also a name of a
piece of software that this company makes and makes available for free you
can think of our studio as a fancy editor for writing our markdown and you
can run our studio on a Mac or on Windows or if you’re concerned about
detecting patient data or intellectual property it’s fairly straightforward to
lock down in our studio server behind Hospital firewall so I’m going to do a
quick demo yes ok so this is the art studio editor and in
our reducible workflow this is where we author our computational documents are
written in our markdown and you can write narrative to explain what you’re
doing which is what I’ve done here and you can also write some code so that’s
what i’m going to do right now so I’m gonna wrote write a line of code written
in R to create a scatter plot of a sample data set and I can actually run
this code and see what the result is so here I see the scatter plot of distance
versus speed of some data set about cars so let’s say that you’ve been working on
an analysis and are now ready to share the preliminary analysis and results
with a collaborator by email or you just want to look at it together on a big
screen like we’re doing right now so this is when you knit your arm markdown
document so I click on it knit to HTML and what you get is this nicely
formatted HTML document that you can view in any web browsers you can email
to collaborators this is really nice for prototyping analyses and looking at them
together you can also knit to PDF so PDF you might want to use for a more
holistic and for printing for example for a manuscript draft or an audit you
want to present at a Qi meeting well you can knit to word so here I have a word
document that was just created from this document and it contains exactly the
same information already so we use our markdown as the fundamental substrate of
our reproducible data analysis workflow you can imagine a pie in a grad student
or a lab director and technologists collaborating on this our marked our
document that contains all of the code to load the initial data to make the
data transformations necessary and to create summary tables and graphics and
as I’ve just shown you can knit are marked on documents into a bunch of
different formats like HTML PDF and Word you can also knit PowerPoint
presentations and even interactive dashboards and a chalk we’ve created a
number of dashboards that are either written entirely in our markdown or
where these prototypes in it so here’s a dashboard from the division of genomic
Diagnostics written by Marisol Mahdi’s group this dashboard displays test
volumes and turnaround times for the various molecular tests we do and here’s
a dashboard that allows performing virtual craftmatic or solid organ
transplant patients and so I think it’s worth pointing out that one area where
all markdown powered dashboards can really shine is when a lab information
system provides 90% of the necessary functionality and you need to
custom-built the remaining 10% and here knowing some basic are marked and can be
really powerful for lab leaders because then you can take much more active
ownership and be actively involved in the developing and iterating of these
tools so let’s say that you’re a busy clinician or researcher and you know
your way around Excel how can you get started with our markdown and this is
something that I’ve been thinking about a lot for the past three years and the
outcome is the course I’ll be talking about in the last few minutes of the
talk so the scope of this course is that we want participants to pick up the
skills necessary for collaborating on a computational document written in our
markdown so the goals of the course are to appreciate the meaning of
reproducibility as it relates data analysis and to learn a practical way to
analyze clinical data producible by the end of the course participants
will be able to define reproducibility explain why it’s important and learn she
was on our studio to import transform and visualize data and create a
reputable report which is just another word for computational document in our
markdown we streamlined the content so that the entire course fits into a one
day workshop that covers getting data into our exploring that data graphically
and writing an R markdown document and we use active learning techniques such
as concept mapping think-pair-share timed interactive exercises in which
course participants practice coding in all markdown and together these in-class
activities get participants to a point where they can follow along the code of
reproducible report that’s written in our mark done so we also invited our
participants to complete an optional course project in the form of an R
markdown document that addresses some clinical or research question that
they’re personally professionally interested in and I want to highlight
three here so the first one is from Lisa Jiang a pathology resident who had no
prior coding experience who was able to construct a machine learning model to
better triage peripheral blood flow cytometry specimens and she presented
her work at escape in a podium presentation and her manuscript is
impress at the american journal of clinical pathology david dye is an MD
PhD student who is able to put together a machine learning model that uses
features of primary Lewy body neuropathology to predict Alzheimer’s
disease and Jeff Lee another student wrote an app in our markdown for
interactive exploration of RNA seek data so far we’ve taught eight iterations of
the course and three of them for the pen pathology department the pen pathology
cohort consisted of 38 pathology residents fellows faculty and a few
others including supervisors or analysts you can see that most of the
participants were trainees but we had a lot of participation from faculty as
well including some senior faculty members
before the workshop we surveyed participants about their confidence in
their programming skills and a lot of people indicated that they felt they
didn’t know enough about programming to really benefit from the course so those
folks over here but when surveyed after the course no one felt that the required
knowledge about programming was too high for them to follow our benefit and I
think and while I like to think that it’s because of my superior teaching
ability it’s more likely because this material just isn’t really that
difficult to learn or understand when presented in the correct form
overall participants felt that the course achieved its stated objectives
people felt more confident in their ability to load data sets and are to
transform data and to create graphic plots and on a scale reserves
unsatisfying in 10 is outstanding the workshop received an average rating of
nine point six to sum up the Excel posed a significant risk to data quality which
can damage reputation harm patience a reproducible data analysis workflow can
improve the data quality and accelerate development of data products
computational documents are key technology for a producible data
analysis and we’ve developed a curriculum to teach reproducible data
analysis to pathology trainees faculty and staff so we’ve also taught this
material at last year’s pathology informatics summit and we’ve been
invited to do it again in Pittsburgh next door
we’re also developing an extended version of this course as a part of a
new rigor and reproducibility module for the Penn MD PhD students and so you can
see that the overarching goal here is to create a community of researchers and
clinicians who are conversant and reproducible data analysis not only
inside the pathology department but throughout the Medical School on the
hospitals

Leave comment

Your email address will not be published. Required fields are marked with *.