Complete our short form to continue

Genestack will process your personal data in accordance to its privacy policy which can be found here. This includes sending you updates by email about our products and content we think it would be of interest to you. You can unsubscribe at any time by clicking the link in the footer of any email we send. By clicking submit you agree that we process your information in accordance with these terms.
AI, Article, Data curation, Data strategy, LLMs, Series

The Importance of Data Curation & The Do’s & Don'ts of Using LLMs

16.07.24

In our previous article, we emphasized the importance of establishing clear and interpretable data standards for all organizational data assets. Once established, these standards facilitate the direct generation of data in the agreed-upon format. However, there are many situations where you have limited control over how data is created. Some reasons for this include:

  1. Historical data: Created long before your data standards were established.
  2. Partner’s data (e.g., CRO): Created using their own guidelines, which may not align with yours.
  3. Public data: Highly customized datasets with varying formats.
  4. Internal process variability: Manual data generation can lead to ignored guidelines, errors, or typos.

In such cases, data must be transformed to match the required guidelines and templates. This process is typically called curation or harmonization.

We will primarily focus on the harmonization of experimental attributes (e.g., experimental design, patient phenotypes, and sample descriptions) rather than measurement data (e.g., lab device measurements or omics data), as the latter can often be handled using computational approaches at scale.

Scenarios for harmonizing experimental descriptions may include:

  1. Convert free text: Map to values from a selected vocabulary or ontology. Mapping 'heart attack' to 'myocardial infarction' using a medical ontology.
  2. Convert numeric values: Change from one unit scale to another. Converting patient ages from years to days for a pediatric study.
  3. Split complex descriptions: Break down into several attributes. Separating 'blood pressure reading' into 'systolic' and 'diastolic' values.
  4. Extract attributes: Derive from larger text segments. Identifying and extracting 'smoking status' from patient medical histories.
  5. Handle missing values: Address and rectify gaps in data. Imputing missing values in a dataset using the mean or median of existing data.
  6. Check for anomalies: Identify and correct inconsistencies. Flagging and reviewing outliers in patient weight measurements.

Traditionally, until the recent advancements in LLM capabilities, curation processes typically fell into one of two categories: manual or automated.

Traditional Curation Process

Traditional Curation Process

The most straightforward curation approach involves a dedicated expert (or a team of experts) well-versed in the guidelines and the experimental data. With proper training and domain knowledge, these professionals can make informed decisions and interpretations of the original experimental descriptions to align with the required guidelines. Human curators excel in cases requiring natural language understanding, such as dealing with synonyms or typos, and are invaluable for transforming large unstructured text segments into formatted sample attribute tables.

If you have a large volume of data that already follows an established format, you can create a straightforward set of rules for automatic transformation. For example, if you receive data from a partner or CRO that follows their own guidelines (which may differ from yours or were deemed adequate when the data was produced), you need to understand their guidelines and the actual values used. Initially, this requires a manual curation process to identify and set the rules, but once established, these rules can be executed automatically. For instance:

Rule 1: Change attribute name "Organism" to "Species".
Rule 2: In "Species" attribute, convert value "Human" to "H. sapiens".
Rule 3: …

Such rules become more sophisticated when certain vocabularies / ontologies (and corresponding value IDs) or massive formatting transformations (e.g., convert age in years to age in days) are involved. But eventually, such a file could be reused as many times as you need so hopefully only a few amendments will be required.

If your data is highly customized and historically generated by different individuals and teams, the number of potential rules increases significantly. You need to familiarize yourself with all the datasets to understand their contents and identify corner cases. Manual curation is appropriate here, but the main limitation is the curators' capacity to handle large amounts of data. Typically, curation is not a full-time job but an additional task someone must complete.

The table below summarizes the traditional approaches most suitable for typical curation cases.

Table 1: Traditional approaches to typical curation cases.

Scenario Manual Curation Automated Curation

Convert Free Text

Convert Free Text

Better When: Understanding context and handling ambiguities is crucial. Expertise in context and handling ambiguities. Large volume of custom data.

Better When: Few known changes are needed for large datasets. Fast, consistent, scalable.

Convert Numeric Values

Convert Numeric Values

Better When: High accuracy and error detection are critical.

Cons: Prone to human error.

Better When: Large volumes of data require fast, consistent processing.

Cons: struggles with outliers.

Split Complex Descriptions

Split Complex Descriptions

Better When: Contextual understanding and quality control are needed. Contextual understanding, quality control.

Cons: Inconsistent interpretations.

Better When: Efficiency and scalability are required. Efficient, consistent, scalable.

Cons: Limited flexibility, depends on training data.

Working with unstructured texts

Working with unstructured texts

Better when: Nuanced understanding and ambiguity resolution are important. Human curators traditionally excel in this task.

Traditional machine learning approaches demonstrate poor quality.

Handle Missing Values

Handle Missing Values

Better When: Values are complex, variable and only able to be determined on a case by case basis.

Cons: Slow, subject to human error, not scalable

Better When: Need for efficient, consistent handling of large data volumes.

Check for Anomalies

Check for Anomalies

Better When: Expert judgment and contextual understanding are required.

Cons: Inconsistent handling at large scale.

Better When: Rapid detection and consistent handling of anomalies in large datasets.

Applying LLMs for Curation

The recent advancements in large language models (LLMs) technology can now replicate the manual, highly customized decision-making process of curation at scale. LLMs are a type of artificial intelligence that can generate human-like text based on the input they receive. They are trained on vast amounts of data, allowing them to understand and respond to a wide range of prompts.

LLMs generate text in response to a prompt. While LLMs are known for occasionally generating factually incorrect information (known as hallucinations), the specificity of the reply largely depends on the strength and focus of the original prompt. Another important aspect of prompts is the context size or number of words added. Despite the evolution of LLMs enabling larger context sizes, they still perform better when the task is relatively short and focused.

While an in-depth introduction to LLMs (e.g., GPT, Claude, Llama, Gemini, Mixtrall) is beyond the scope of this blogpost, we will outline the most important DOs and DON'Ts of using LLMs for curation purposes.

Applying LLMs for Curation: DON'T:

DON'T:

Applying LLMs for Curation: DO:

DO:

As you can see, LLMs can complete most of the tasks traditionally attributed to manual curation. However, LLMs cannot execute a curation of relatively big data tables in one run. Just like humans, they excel when the big task is split into smaller steps, focusing on one step at a time.

Conclusion

LLMs have the potential to revolutionize the curation process by automating tasks that were previously manual and time-consuming. By following the DOs and DON'Ts outlined in this post, you can effectively leverage LLMs to augment your curation workflows and achieve results at scale.

Remember to provide clear context, prepare your data, and use appropriate tools to ensure the best possible outcomes. And most importantly, always include human oversight to review and validate the LLM's output.

By incorporating LLMs into your curation process, you can harness the power of artificial intelligence to streamline your workflows, reduce manual effort, and ultimately achieve better results faster. Working with a data management system with data curation and governance solutions built into it can also ensure that your data is kept consistent and accurate, further empowering your LLMs.

Learn more about how Genestack can help to supercharge your LLMs for any research task at www.genestack.com/ai

16.07.24

Sign up for our newsletter