back to home

ConardLi / easy-dataset

A powerful tool for creating datasets for LLM fine-tuning 、RAG and Eval

13,397 stars
1,331 forks
115 issues
JavaScriptHTMLShell

AI Architecture Analysis

This repository is indexed by RepoMind. By analyzing ConardLi/easy-dataset in our AI interface, you can instantly generate complete architecture diagrams, visualize control flows, and perform automated security audits across the entire codebase.

Our Agentic Context Augmented Generation (Agentic CAG) engine loads full source files into context, avoiding the fragmentation of traditional RAG systems. Ask questions about the architecture, dependencies, or specific features to see it in action.

Embed this Badge

Showcase RepoMind's analysis directly in your repository's README.

[![Analyzed by RepoMind](https://img.shields.io/badge/Analyzed%20by-RepoMind-4F46E5?style=for-the-badge)](https://repomind-ai.vercel.app/repo/ConardLi/easy-dataset)
Preview:Analyzed by RepoMind

Repository Summary (README)

Preview
<div align="center">

<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ConardLi/easy-dataset"> <img alt="GitHub Downloads (all assets, all releases)" src="https://img.shields.io/github/downloads/ConardLi/easy-dataset/total"> <img alt="GitHub Release" src="https://img.shields.io/github/v/release/ConardLi/easy-dataset"> <img src="https://img.shields.io/badge/license-AGPL--3.0-green.svg" alt="AGPL 3.0 License"/> <img alt="GitHub contributors" src="https://img.shields.io/github/contributors/ConardLi/easy-dataset"> <img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/ConardLi/easy-dataset"> <a href="https://arxiv.org/abs/2507.04009v1" target="_blank"> <img src="https://img.shields.io/badge/arXiv-2507.04009-b31b1b.svg" alt="arXiv:2507.04009"> </a>

<a href="https://trendshift.io/repositories/13944" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13944" alt="ConardLi%2Feasy-dataset | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

A powerful tool for creating fine-tuning datasets for Large Language Models

简体中文 | English | Türkçe

FeaturesQuick StartDocumentationContributingLicense

If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!

</div>

Overview

Easy Dataset is an application specifically designed for building large language model (LLM) datasets. It features an intuitive interface, along with built-in powerful document parsing tools, intelligent segmentation algorithms, data cleaning and augmentation capabilities. The application can convert domain-specific documents in various formats into high-quality structured datasets, which are applicable to scenarios such as model fine-tuning, retrieval-augmented generation (RAG), and model performance evaluation.

News

🎉🎉 Easy Dataset Version 1.7.0 launches brand-new evaluation capabilities! You can effortlessly convert domain-specific documents into evaluation datasets (test sets) and automatically run multi-dimensional evaluation tasks. Additionally, it comes with a human blind test system, enabling you to easily meet needs such as vertical domain model evaluation, post-fine-tuning model performance assessment, and RAG recall rate evaluation. Tutorial: https://www.bilibili.com/video/BV1CRrVB7Eb4/

Features

📄 Document Processing & Data Generation

  • Intelligent Document Processing: Supports PDF, Markdown, DOCX, TXT, EPUB and more formats with intelligent recognition
  • Intelligent Text Splitting: Multiple splitting algorithms (Markdown structure, recursive separators, fixed length, code-aware chunking), with customizable visual segmentation
  • Intelligent Question Generation: Auto-extract relevant questions from text segments, with question templates and batch generation
  • Domain Label Tree: Intelligently builds global domain label trees based on document structure, with auto-tagging capabilities
  • Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT), with AI optimization
  • Data Cleaning: Intelligent text cleaning to remove noise and improve data quality

🔄 Multiple Dataset Types

  • Single-Turn QA Datasets: Standard question-answer pairs for basic fine-tuning
  • Multi-Turn Dialogue Datasets: Customizable roles and scenarios for conversational format
  • Image QA Datasets: Generate visual QA data from images, with multiple import methods (directory, PDF, ZIP)
  • Data Distillation: Generate label trees and questions directly from domain topics without uploading documents

📊 Model Evaluation System

  • Evaluation Datasets: Generate true/false, single-choice, multiple-choice, short-answer, and open-ended questions
  • Automated Model Evaluation: Use Judge Model to automatically evaluate model answer quality with customizable scoring rules
  • Human Blind Test (Arena): Double-blind comparison of two models' answers for unbiased evaluation
  • AI Quality Assessment: Automatic quality scoring and filtering of generated datasets

🛠️ Advanced Features

  • Custom Prompts: Project-level customization of all prompt templates (question generation, answer generation, data cleaning, etc.)
  • GA Pair Generation: Genre-Audience pair generation to enrich data diversity
  • Task Management Center: Background batch task processing with monitoring and interruption support
  • Resource Monitoring Dashboard: Token consumption statistics, API call tracking, model performance analysis
  • Model Testing Playground: Compare up to 3 models simultaneously

📤 Export & Integration

  • Multiple Export Formats: Alpaca, ShareGPT, Multilingual-Thinking formats with JSON/JSONL file types
  • Balanced Export: Configure export counts per tag for dataset balancing
  • LLaMA Factory Integration: One-click LLaMA Factory configuration file generation
  • Hugging Face Upload: Direct upload datasets to Hugging Face Hub

🤖 Model Support

  • Wide Model Compatibility: Compatible with all LLM APIs that follow the OpenAI format
  • Multi-Provider Support: OpenAI, Ollama (local models), Zhipu AI, Alibaba Bailian, OpenRouter, and more
  • Vision Models: Support Gemini, Claude, etc. for PDF parsing and image QA

🌐 User Experience

  • User-Friendly Interface: Modern, intuitive UI designed for both technical and non-technical users
  • Multi-Language Support: Complete Chinese, English, and Turkish language support 🇹🇷
  • Dataset Square: Discover and explore public dataset resources
  • Desktop Clients: Available for Windows, macOS, and Linux

Quick Demo

https://github.com/user-attachments/assets/6ddb1225-3d1b-4695-90cd-aa4cb01376a8

Local Run

Download Client

<table style="width: 100%"> <tr> <td width="20%" align="center"> <b>Windows</b> </td> <td width="30%" align="center" colspan="2"> <b>MacOS</b> </td> <td width="20%" align="center"> <b>Linux</b> </td> </tr> <tr style="text-align: center"> <td align="center" valign="middle"> <a href='https://github.com/ConardLi/easy-dataset/releases/latest'> <img src='./public/imgs/windows.png' style="height:24px; width: 24px" /> <br /> <b>Setup.exe</b> </a> </td> <td align="center" valign="middle"> <a href='https://github.com/ConardLi/easy-dataset/releases/latest'> <img src='./public/imgs/mac.png' style="height:24px; width: 24px" /> <br /> <b>Intel</b> </a> </td> <td align="center" valign="middle"> <a href='https://github.com/ConardLi/easy-dataset/releases/latest'> <img src='./public/imgs/mac.png' style="height:24px; width: 24px" /> <br /> <b>M</b> </a> </td> <td align="center" valign="middle"> <a href='https://github.com/ConardLi/easy-dataset/releases/latest'> <img src='./public/imgs/linux.png' style="height:24px; width: 24px" /> <br /> <b>AppImage</b> </a> </td> </tr> </table>

Install with NPM

  1. Clone the repository:
   git clone https://github.com/ConardLi/easy-dataset.git
   cd easy-dataset
  1. Install dependencies:
   npm install
  1. Start the development server:
   npm run build

   npm run start
  1. Open your browser and visit http://localhost:1717

Using the Official Docker Image

  1. Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
  1. Modify the docker-compose.yml file:
services:
  easy-dataset:
    image: ghcr.io/conardli/easy-dataset
    container_name: easy-dataset
    ports:
      - '1717:1717'
    volumes:
      - ./local-db:/app/local-db
      - ./prisma:/app/prisma
    restart: unless-stopped

Note: It is recommended to use the local-db and prisma folders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.

Note: The database file will be automatically initialized on first startup, no need to manually run npm run db:push.

  1. Start with docker-compose:
docker-compose up -d
  1. Open a browser and visit http://localhost:1717

Building with a Local Dockerfile

If you want to build the image yourself, use the Dockerfile in the project root directory:

  1. Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
  1. Build the Docker image:
docker build -t easy-dataset .
  1. Run the container:
docker run -d \
  -p 1717:1717 \
  -v ./local-db:/app/local-db \
  -v ./prisma:/app/prisma \
  --name easy-dataset \
  easy-dataset

Note: It is recommended to use the local-db and prisma folders in the current code repository directory as mount paths to maintain consistency with the database paths when starting via NPM.

Note: The database file will be automatically initialized on first startup, no need to manually run npm run db:push.

  1. Open a browser and visit http://localhost:1717

Documentation

Community Practice

Contributing

We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add some amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request (submit to the DEV branch)

Please ensure that tests are appropriately updated and adhere to the existing coding style.

Join Discussion Group & Contact the Author

https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men

License

This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.

Citation

If this work is helpful, please kindly cite as:

@misc{miao2025easydataset,
  title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
  author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
  year={2025},
  eprint={2507.04009},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.04009}
}

Star History

Star History Chart

<div align="center"> <sub>Built with ❤️ by <a href="https://github.com/ConardLi">ConardLi</a> • Follow me: <a href="./public/imgs/weichat.jpg">WeChat Official Account</a>|<a href="https://space.bilibili.com/474921808">Bilibili</a>|<a href="https://juejin.cn/user/3949101466785709">Juejin</a>|<a href="https://www.zhihu.com/people/wen-ti-chao-ji-duo-de-xiao-qi">Zhihu</a>|<a href="https://www.youtube.com/@garden-conard">Youtube</a></sub> </div>