Nightingale open science 40tb4/10/2023 ![]() By contrast, most electronic health record data (for example, diagnoses, procedures and text-based notes) are directly produced by doctors, who are necessarily aware of the information they contain. Linkages, for example to Social Security data in the USA or government registries elsewhere, can be essential but are neglected in many current datasets of health records.įirst, medical images are rich sources of signal about patient health - so rich that doctors are unlikely to make full use of all the information contained within them. Many health systems in the USA only record a patient’s death if it happens within the four walls of the hospital - a problem given that only one-third of deaths in the USA occur in hospital. More problematic still, ground-truth labeling is also often infeasible in existing datasets: they can require dedicated efforts to link health system data to external sources of truth, for example, cancer registries or death records. This places a major burden on individual researchers, particularly those without deep medical domain expertise. But doing so, even when comprehensive electronic health records are available, requires a great deal of specialized knowledge: about cancer and where it metastasizes, how that event is recorded in the course of usual care, and how structural biases in health care affect when and how data are recorded. It would be useful to know whether a patient ultimately progressed to metastatic cancer. Consider the task of labeling a biopsy image. The task of creating ground-truth labels is not easy. To do so, we need algorithms that learn from nature - patient experiences and health outcomes - not physician judgment. And ultimately, this approach is highly limiting: we want algorithms to do better than humans, not just produce the same results. While human labels are useful for efforts to automate human judgment, such efforts will also automate human biases and errors 12, 13. Many health datasets available today implicitly treat human opinion as ground truth: an ECG is labeled with a cardiologist’s judgment of arrhythmia, an X-ray is labeled with a radiologist’s judgment on the severity of arthritis. ![]() Specifically, they are not labeled with the ground-truth patient outcomes that are necessary for researchers to solve non-trivial problems. Second, health datasets are seldom curated in a way that allows researchers to meaningfully engage with critical questions. While they have enormous benefit to everyone in the long run - patients, health systems and industry - no single actor has a strong incentive to act (for a thoughtful review, see ref. ![]() Open data are a classic public good: market forces do not favor their creation. But given the many technical solutions to this problem, from sophisticated deidentification methods to highly secure cloud environments, this cannot be the only reason. Highly talented researchers who could make major contributions to medicine are diverted into solving trivial problems in other fields.Ī commonly cited reason for these barriers to access is the protection of patient privacy. Their performance cannot be adequately scrutinized, leading to failures of replication and erosion of trust 10. Algorithms are designed largely to serve the needs of the privileged 9. This has a variety of negative consequences. Access for everyone else is laborious, costly, time-consuming or just impossible, despite the fact that the creation of nearly all health data, whether from insurance premiums or research grants, is publicly funded. Instead, they are often controlled by a handful of researchers at well-resourced institutions or companies. Datasets meeting these two criteria are the ‘secret sauce’ of machine learning - more than just computing power, or individual genius - and underlie the unprecedented recent progress in translation, sentiment analysis, object and facial recognition, and other tasks 8 (Table 1).Įxisting health datasets seldom meet these two criteria. Second, the data must be curated around ‘common tasks’: important, field-defining problems on which a community of researchers can collaborate, compete and improve. Only then can good ideas thrive, on a level and just playing field. ![]() Instead, the data must be accessible at low cost, in terms of money and in terms of time. First, they must be open access: they cannot be monopolized by those who produce it, whether academics, non-profits or corporations. Instead, recent successes from other disciplines - genomics, computational biology, language modeling, and image recognition, to name a few - suggest that datasets must also possess two specific features. Like any scientific field, medicine needs data to grow and thrive. ![]()
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |