Source URL: https://simonwillison.net/2024/Nov/16/nuextract-15/#atom-everything
Source: Simon Willison’s Weblog
Title: NuExtract 1.5
Feedly Summary: NuExtract 1.5
Structured extraction – where an LLM helps turn unstructured text (or image content) into structured data – remains one of the most directly useful applications of LLMs.
NuExtract is a family of small models directly trained for this purpose, and released under the MIT license.
It comes in a variety of shapes and sizes:
NuExtract-v1.5 is a 3.8B parameter model fine-tuned on Phi-3.5-mini instruct. You can try this one out in this playground.
NuExtract-tiny-v1.5 is 494M parameters, fine-tuned on Qwen2.5-0.5B.
NuExtract-1.5-smol is 1.7B parameters, fine-tuned on SmolLM2-1.7B.
All three models were fine-tuned on NuMind’s “private high-quality dataset". It’s interesting to a model family that uses one fine-tuning set against three completely different base models.
Tags: llms, ai, generative-ai
AI Summary and Description: Yes
Summary: The text discusses NuExtract 1.5, an advanced tool for structured data extraction from unstructured formats using large language models (LLMs). It highlights the innovation in leveraging multiple base models fine-tuned on a shared dataset, making it relevant for professionals in AI and LLM security.
Detailed Description:
NuExtract 1.5 represents a significant advancement in the application of large language models (LLMs) for structured extraction from unstructured data, such as text and images. This utility offers professionals in the fields of AI security and information management new avenues for data processing and extraction efficiency.
– **Key Features of NuExtract 1.5**:
– **Purpose**: Designed to convert unstructured text and image content into structured data, a commonly sought-after capability in data analytics and compliance.
– **Model Variations**:
– **NuExtract-v1.5**: A 3.8 billion parameter model fine-tuned on Phi-3.5-mini instruct.
– **NuExtract-tiny-v1.5**: A smaller model at 494 million parameters, fine-tuned on Qwen2.5-0.5B, suitable for lightweight applications.
– **NuExtract-1.5-smol**: A medium-sized model with 1.7 billion parameters, fine-tuned on SmolLM2-1.7B.
– **Training Dataset**: All models were fine-tuned on NuMind’s “private high-quality dataset,” emphasizing the quality of data used in training, which is critical for ensuring better performance and reliability in data extraction tasks.
– **Innovation**: The technique of using different base models with a consistent fine-tuning dataset demonstrates a novel approach that may lead to varied performance outputs, catering to different data processing needs.
This combination of size, fine-tuning, and specific application is particularly relevant for organizations looking to enhance their data processing capabilities securely, ensuring accurate information extraction within compliance parameters. These iterations could also be subjected to security scrutiny in terms of data handling, model vulnerabilities, and privacy considerations.