TechSetupGuides
Intermediatenlppythonchinese-nlpdatasetsmachine-learning

funNLP Chinese/English NLP Toolkit and Resource Library

Set up and use fighting41love/funNLP - a comprehensive collection of 80,000+ starred Chinese NLP resources including sensitive word detection, language detection, datasets, and tools.

  1. Step 1

    What is funNLP?

    funNLP (fighting41love/funNLP) is one of the most comprehensive Chinese NLP resource collections on GitHub with over 80,000 stars. Unlike a traditional package, it's a curated repository that links to external NLP tools, datasets, and resources. It covers sensitive word detection, language detection, phone carrier lookup, name gender inference, email/ID extraction, Chinese/Japanese name databases, vocabulary sentiment values, stop words, and much more.

  2. Step 2

    Technology Stack

    The funNLP project collects resources across multiple technologies:

    Primary Language: Python (365 files in repository)

    Categories Covered:

    • Text Processing: jieba, HanLP, THULAC, LTP, NLTK, spaCy
    • Deep Learning Frameworks: PyTorch, TensorFlow, Keras
    • Pre-trained Models: BERT, RoBERTa, ALBERT, GPT-2, XLNet, ELECTRA, LLaMA
    • NLP Tasks: Named Entity Recognition (NER), Text Classification, Sentiment Analysis, Text Summarization, Machine Translation, Question Answering, Text Generation
    • Knowledge Graphs: Neo4j, AllegroGraph, Jena
    • OCR: cnocr, PaddleOCR, Tesseract
    • Speech Recognition: ASR datasets and tools
    • Visualization: Matplotlib, Seaborn, Plotly

    Data Formats: CSV, JSON, TXT, Pickle, HDF5, Parquet

    Core Technologies in funNLP:
    - Python 3.x (primary language)
    - Deep Learning: PyTorch, TensorFlow 2.x, Keras
    - NLP Libraries: jieba, HanLP, THULAC, StanfordNLP, spaCy
    - Transformers: HuggingFace transformers, tokenizers, datasets
    - Data Processing: pandas, numpy, scipy
    - Model Serving: FastAPI, Flask, Gradio
    - Visualization: matplotlib, seaborn, pyecharts
  3. Step 3

    Repository Structure

    funNLP is organized as a README-based resource catalog with supporting data files:

    funNLP/
    ├── README.md                    # Main catalog with categorized links
    ├── .github/                     # GitHub configuration
    └── data/                        # Supporting data files
        ├── .logo图片/               # Logo images
        ├── IT词库/                  # IT terminology dictionary
        ├── 中文分词词库整理/        # Chinese word segmentation dictionaries
        ├── 中文缩写库/             # Chinese abbreviation library
        ├── 中日文名字库/         # Chinese/English/Japanese name databases
        ├── 停用词/                  # Stop words
        ├── 公司名字词库/           # Company name dictionaries
        ├── 动物词库/               # Animal vocabularies
        ├── 医学词库/               # Medical terminology
        ├── 历史名人词库/           # Historical figures
        ├── 古诗词库/               # Ancient poetry
        ├── 同义词库、反义词库/     # Synonyms, antonyms
        ├── 地名词库/               # Place names
        ├── 成语词库/               # Chinese idioms
        ├── 法律词库/               # Legal terminology
        ├── 繁简体转换词库/       # Traditional/Simplified conversion
        ├── 职业词库/               # Occupation vocabulary
        ├── 财经词库/               # Financial terminology
        └── 食物词库/               # Food vocabulary
    

    The main README.md contains categorized links to external repositories, papers, and tools.

  4. Step 4

    Usage Approach

    Since funNLP is a resource catalog rather than a package, you don't install it directly. Instead:

    Option 1: Clone for reference

    git clone https://github.com/fighting41love/funNLP.git
    cd funNLP
    

    Option 2: Browse online Visit https://github.com/fighting41love/funNLP to browse the categorized resource list.

    Option 3: Download specific dictionaries Clone only specific data files you need using git sparse-checkout or download individual files from the data/ directory.

    # Clone the repository for offline reference
    git clone https://github.com/fighting41love/funNLP.git
    
    # Or use sparse checkout for specific directories only
    git clone --filter=blob:none --sparse https://github.com/fighting41love/funNLP.git
    cd funNLP
    git sparse-checkout add "data/停用词" "data/中文分词词库整理"
    git checkout
    ⚠ Heads up: The repository is a large resource catalog. Consider using sparse checkout if you only need specific dictionaries or data files.
  5. Step 5

    Key Resource Categories

    The funNLP README organizes resources into major categories:

    Core NLP Tasks:

    • Text Classification (文本分类)
    • Named Entity Recognition / Information Extraction (命名实体识别/信息抽取)
    • Sentiment Analysis (情感分析)
    • Text Summarization (文本摘要)
    • Text Generation (文本生成)
    • Question Answering (智能问答)
    • Machine Translation (机器翻译)
    • Text Similarity/Matching (文本匹配)
    • Spelling Correction (文本纠错)

    LLM & ChatGPT Resources:

    • ChatGPT-like model benchmarks (类 ChatGPT 模型评测)
    • LLM training and inference (LLM 训练_推理)
    • Prompt Engineering (提示工程)
    • RAG/Dataset for LLMs (LLM 数据集)
    • Industry Applications (行业应用)

    Data Resources:

    • Corpora (语料库): Chinese/English training datasets
    • Dictionaries (词库): Stop words, sensitive words, specialized vocabularies
    • Knowledge Graphs (知识图谱): Neo4j tutorials, KG construction tools
    ## funNLP Major Categories (from README):
    
    ### LLM & ChatGPT Section
    - 类 ChatGPT 的模型评测对比 (Model benchmarks)
    - 类 ChatGPT 的资料 (Research papers)
    - 类 ChatGPT 的开源框架 (Open source frameworks)
    - LLM 的训练_推理_低资源_高效训练 (Training & inference)
    - 提示工程 (Prompt Engineering)
    - 类 ChatGPT 的文档问答 (RAG/DQA)
    - 多模态 LLM (Multimodal LLMs)
    
    ### Traditional NLP
    - 语料库 (Corpora)
    - 词库及词法工具 (Dictionaries & lexicon tools)
    - 预训练语言模型 (Pre-trained language models)
    - 抽取 (Information extraction)
    - 知识图谱 (Knowledge graphs)
    - 文本生成 (Text generation)
    - 文本摘要 (Text summarization)
    - 智能问答 (Question answering)
    - 文本纠错 (Spelling correction)
    - 文档处理 (Document processing)
    - 表格处理 (Table processing)
    - 文本匹配 (Text matching)
    - 文本分类 (Text classification)
    - 情感分析 (Sentiment analysis)
    - 机器翻译 (Machine translation)
    - 语音处理 (Speech processing)
  6. Step 6

    Accessing the Data Files

    The data/ directory contains actual downloadable resources:

    **Chinese Stop Words **(停用词)

    cd funNLP/data/停用词
    # Contains multiple stop word lists: stopWords.txt, CNStopWord.txt, etc.
    

    Chinese Name Databases

    cd funNLP/data/中日文名字库/
    # Contains: Chinese names, Japanese names, English names with gender predictions
    

    Specialized Dictionaries Medical, legal, financial, and other domain-specific vocabularies.

    # Example: Load stop words from the repository
    import os
    
    # Define path to stop words
    stop_words_path = 'funNLP/data/停用词/Tencent.txt'  # or other stop word files
    
    with open(stop_words_path, 'r', encoding='utf-8') as f:
        stop_words = set(line.strip() for line in f if line.strip())
    
    print(f'Loaded {len(stop_words)} stop words')
    
    # Filter text
    def remove_stop_words(text, stop_words, separator=' '):
        words = text.split(separator)
        filtered = [word for word in words if word not in stop_words]
        return separator.join(filtered)
    
    sample = "这是一个测试文本,包含一些停用词"
    print(remove_stop_words(sample, stop_words, separator=''))
  7. Step 7

    Common Use Cases

    Use Case 1: Content Moderation Use the sensitive word dictionaries to filter inappropriate content in Chinese text.

    Use Case 2: NLP Model Training Find pre-processed datasets and training corpora for various NLP tasks.

    Use Case 3: Resource Discovery Discover new NLP tools and libraries by browsing the categorized link list.

    Use Case 4: Dictionary Lookup Access specialized Chinese dictionaries for domain-specific NLP tasks (medical, legal, financial).

    # Use case: Load multiple dictionaries for domain-specific NLP
    
    # Load medical terminology
    with open('funNLP/data/医学词库/医学词库.txt', 'r', encoding='utf-8') as f:
        medical_terms = [line.strip() for line in f if line.strip()]
    
    # Load financial terminology  
    with open('funNLP/data/财经词库/财经词库.txt', 'r', encoding='utf-8') as f:
        financial_terms = [line.strip() for line in f if line.strip()]
    
    print(f'Loaded {len(medical_terms)} medical terms')
    print(f'Loaded {len(financial_terms)} financial terms')
    
    # Use with jieba to improve segmentation
    import jieba
    jieba.load_userdict('funNLP/data/医学词库/医学词库.txt')
    text = "患者出现头痛和发热症状"
    seg_list = jieba.cut(text)
    print(list(seg_list))
  8. Step 8

    Integrating with External Tools

    funNLP links to many external tools. Here are common integration patterns:

    **Using linked tools **(via the README links)

    1. Browse the README.md to find tools for your specific task
    2. Follow the link to the external repository
    3. Install and use those tools according to their documentation

    Example tools linked from funNLP:

    • jieba: Chinese word segmentation
    • HanLP: Complete NLP toolkit for Chinese
    • THULAC: Tsinghua online Chinese word segmentation
    • LTP: Peking University Language Tech Platform
    • pkuseg: Peking University NLP segmentation
    • SnowNLP: Simple Chinese text analysis
    # Install popular NLP tools mentioned in funNLP
    
    # Basic Chinese NLP tools
    pip install jieba        # Chinese word segmentation
    pip install snownlp      # Chinese sentiment analysis
    pip install pkuseg       # Peking University segmenter
    pip install thulac       # Tsinghua segmenter
    
    # Deep Learning NLP
    pip install transformers  # HuggingFace transformers
    pip install torch         # PyTorch
    pip install spacy         # Industrial-strength NLP
    pip install nltk          # Natural Language Toolkit
    
    # For model serving
    pip install fastapi       # Fast API for NLP services
    pip install gradio        # Easy NLP model demos
  9. Step 9

    Finding Specific Resources

    The funNLP README is organized with Markdown anchors. Use these search strategies:

    Search in README: Use grep or GitHub search to find specific entries.

    Browse data directories directly: List the data/ directory to find available dictionaries.

    # Search for specific resources in funNLP
    cd funNLP
    
    # Find sentiment analysis resources
    grep -i "情感分析" README.md | head -20
    
    # Find NER resources
    grep -i "命名实体" README.md | head -20
    
    # Find BERT resources
    grep -i "BERT" README.md | head -20
    
    # List all available data dictionaries
    ls -la data/
    
    # Read a specific dictionary file
    head -20 data/停用词/stopWords.txt
  10. Step 10

    Additional Information

    Repository Statistics:

    • Stars: 80,000+
    • Forks: 15,000+
    • Primary Language: Python
    • Created: August 2018
    • Last Updated: May 2024
    • Size: ~174 MB

    Author: fighting41love (GitHub) Homepage: https://zhuanlan.zhihu.com/yangyangfuture

    License: No explicit license declared - check individual resources for their own licensing.

    ⚠ Heads up: Each linked resource has its own licensing terms. Check licenses before using in commercial projects. The funNLP repository itself does not have an explicit license file.
  11. Step 11

    Next Steps

    After exploring funNLP:

    1. Identify your NLP task: Browse the categorized links to find tools for your specific use case
    2. Install required tools: Use pip or follow README installation instructions for specific tools
    3. Download needed data: Copy relevant dictionaries from the data/ directory to your project
    4. Explore linked repositories: Follow links to in-depth resources and implementations
    5. Set up your development environment: Install Python 3.8+, required libraries, and model dependencies

    Related resources:

    • HuggingFace Hub for pre-trained models
    • Papers With Code for SOTA benchmarks
    • ArXiv for latest NLP research
    ## Quick Links from funNLP
    
    ### Essential Chinese NLP Tools
    - [jieba](https://github.com/fxsjy/jieba) - Chinese Word Segmentation
    - [HanLP](https://github.com/hankcs/HanLP) - Comprehensive Chinese NLP
    - [THULAC](https://github.com/thuzhuthulac/thulac-python) - Tsinghua Segmenter
    - [LTP](https://github.com/PKU1ALM/ltp) - Peking University LTP
    - [pkuseg](https://github.com/lancy/pkuseg) - Peking University Segmenter
    - [SnowNLP](https://github.com/isnowfy/snownlp) - Chinese Text Analysis
    
    ### Pre-trained Models
    - [HuggingFace Transformers](https://github.com/huggingface/transformers)
    - [BERT](https://github.com/google-research/bert)
    - [RoBERTa](https://github.com/pytorch/fairseq)
    - [Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)
    
    ### Datasets & Corpora
    - [CLUE](https://github.com/CLUEbenchmark/CLUE) - Chinese NLP Benchmark
    - [C3](https://github.com/IDEA-Research/TopiConv) - Conversation Dataset
    - [AFQMC](https://github.com/AIFund-2021/AIQMC) - Text Semantic Similarity

Feature requests

Sign in to suggest features or vote on existing ones.

No feature requests yet.

Discussion

0 people marked this as worked·Sign in to mark your own.

Sign in to join the discussion.

No comments yet.