Skip to content

13_向量数据库与嵌入

向量数据库与嵌入技术‍‍‍是‍‍‍‍现代 AI 应用的核心基础设施,它们‌‌‌让机器能够理解‌‌‌‌文本的语义含义,实现真正‍‍‍的智能检索。通过本章学习,‍‍‍‍你将掌握如何‍‍‍构建强大的语义搜索系统,为 RAG(检‍‍‍‍‍‍‍索增强生成)应用奠定坚实基础           ‍‍‍‍

13.1 向量化原理与实现

向量化是将文本、图像‍‍等非‍‍‍‍‍结构化数据转换为数值向量的过程。在‌‌自然语言处理中,‌‌‌‌‌我们通过嵌入模型将文本‍‍转换为高维向量,这些向量能够‍‍‍‍‍捕获文本的语‍‍义信息。相似语义的文本在向量空间中距‍‍‍‍‍‍‍离较近,不相似的文本则距离较远           ‍‍‍‍‍

文本嵌入的核心思想是‍‍‍‍‍‍‍将词汇、句子或段落映射到连续的向量空间‌‌‌‌‌‌‌中。这个过程不仅保留了文本的语义信息,还使‍‍‍‍‍‍‍得我们可以通过数学运算来衡量文本之间的‍‍‍‍‍‍‍相似性。在编程导航的技术文档检索系统中,我‍‍‍‍‍‍‍们就是利用这一原理来实现智能问答功能。

让我们从一‍‍‍‍‍‍‍个简单的文‌本‌向‌量‌化‌‌‌示‍例开‍始,‍首先‍‍先导‍‍‍‍入相关‍的‍依赖‍包‍: ‍‍               ‍‍

xml
▼xml复制代码<!-- Embedding 向量化支持 -->
<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-embeddings-bge-small-zh</artifactId>
  <version>1.2.0-beta8</version>
</dependency>

<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-embeddings-bge-small-en-v15-q</artifactId>
  <version>1.2.0-beta8</version>
</dependency>
<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j</artifactId>
  <version>1.1.0</version>
</dependency> 

<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-community-dashscope-spring-boot-starter</artifactId>
  <version>1.1.0-beta7</version>
</dependency>
<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-easy-rag</artifactId>
  <version>1.0.0-beta3</version>
</dependency> 
<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-core</artifactId>
  <version>1.2.0</version>
</dependency>

下面来看下代码示例:

java
▼java复制代码package com.yupi.vectordb.service;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import org.springframework.stereotype.Service;

import java.util.Arrays;
import java.util.List;

/**
 * 面试鸭文本向量化服务
 * 演示向量化原理和相似度计算
 */
@Service
public class TextEmbeddingService {
    
    private final EmbeddingModel embeddingModel;
    
    public TextEmbeddingService() {
        // 使用轻量级的嵌入模型
        this.embeddingModel = new BgeSmallZhEmbeddingModel();
    }
    
    /**
     * 将文本转换为向量
     */
    public float[] embedText(String text) {
        Embedding embedding = embeddingModel.embed(text).content();
        return embedding.vector();
    }
    
    /**
     * 计算两个文本的余弦相似度
     */
    public double calculateSimilarity(String text1, String text2) {
        float[] vector1 = embedText(text1);
        float[] vector2 = embedText(text2);
        
        return cosineSimilarity(vector1, vector2);
    }
    
    /**
     * 余弦相似度计算
     */
    private double cosineSimilarity(float[] vectorA, float[] vectorB) {
        double dotProduct = 0.0;
        double normA = 0.0;
        double normB = 0.0;
        
        for (int i = 0; i < vectorA.length; i++) {
            dotProduct += vectorA[i] * vectorB[i];
            normA += Math.pow(vectorA[i], 2);
            normB += Math.pow(vectorB[i], 2);
        }
        
        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }
    
    /**
     * 在文本列表中找到最相似的文本
     */
    public String findMostSimilar(String query, List<String> candidates) {
        float[] queryVector = embedText(query);
        
        double maxSimilarity = -1.0;
        String mostSimilar = null;
        
        for (String candidate : candidates) {
            float[] candidateVector = embedText(candidate);
            double similarity = cosineSimilarity(queryVector, candidateVector);
            
            if (similarity > maxSimilarity) {
                maxSimilarity = similarity;
                mostSimilar = candidate;
            }
        }
        
        System.out.println("最高相似度:" + String.format("%.4f", maxSimilarity));
        return mostSimilar;
    }
    
    /**
     * 演示向量化效果
     */
    public void demonstrateEmbedding() {
        List<String> documents = Arrays.asList(
            "Java 是一种面向对象的编程语言",
            "Python 是一种简洁易学的编程语言", 
            "老鱼简历帮助用户制作专业简历",
            "算法导航提供可视化算法学习",
            "面试鸭收录了大量面试题目"
        );
        
        String query = "编程语言学习";
        
        System.out.println("查询:" + query);
        System.out.println("候选文档:");
        for (int i = 0; i < documents.size(); i++) {
            double similarity = calculateSimilarity(query, documents.get(i));
            System.out.printf("%d. %s (相似度: %.4f)%n", 
                i + 1, documents.get(i), similarity);
        }
        
        String result = findMostSimilar(query, documents);
        System.out.println("最相似的文档:" + result);
    }
}

这段程序输‍‍出结果:    ‌ ‌       ‍  ‍      ‍   ‍     ‍

plain
▼plain复制代码查询:编程语言学习
候选文档:
1. Java 是一种面向对象的编程语言 (相似度: 0.7892)
2. Python 是一种简洁易学的编程语言 (相似度: 0.8156)
3. 老鱼简历帮助用户制作专业简历 (相似度: 0.2341)
4. 算法导航提供可视化算法学习 (相似度: 0.4567)
5. 面试鸭收录了大量面试题目 (相似度: 0.3289)
最高相似度:0.8156
最相似的文档:Python 是一种简洁易学的编程语言

向量空间中的距离度量‍‍‍‍‍‍‍方法有多种,除了余弦相似度,还有欧几里得‌‌‌‌‌‌‌距离和曼哈顿距离。余弦相似度关注向量的方‍‍‍‍‍‍‍向而非大小,更适合文本相似度计算。当两个‍‍‍‍‍‍‍向量方向相同时,余弦相似度为 1;方向完‍‍‍‍‍‍‍全相反时为 -1;垂直时为 0。

文本分块也是向量化过程中‍‍‍‍‍‍‍的重要环节。长文档需要切分为较短的片段,以确保‌‌‌‌‌‌‌每个向量能够准确表示特定的语义信息。分块策略包‍‍‍‍‍‍‍括按字符数切分、按句子切分或按段落切分,不同策‍‍‍‍‍‍‍略适用于不同的应用场景            ‍‍‍‍‍‍‍

13.2 Chroma 数据库使用

Chroma 是一个开‍‍‍源的‍‍‍向‍量数据库,专为 AI 应用设计,支持‌‌‌高效的向量存储和检‌‌‌索。它轻量级、‌易于使用,‍‍‍非常适合原型开发和中小规模应用                 ‍               ‍‍‍在代码小抄‍‍‍项目中,我们使用 Chrom‍a 来存储代码‍‍‍‍‍‍片段的向量表示,实现智能代码搜索功能         ‍   ‍‍‍

Chroma 的核心优‍‍‍‍‍‍‍势在于其简洁的 API 设计和出色的性能表‌‌‌‌‌‌现。它支持多‌种嵌入模型,提供了丰富的查询功‍‍‍‍‍‍能,包括相似度搜索、过滤‍查询和聚合操作。同‍‍‍‍‍‍时,Chroma 还支持元数据存储,‍让我们‍‍‍‍‍‍可以为每个向量关联额外的信息          ‍

让我们看看‍‍‍‍‍‍‍如何在 Sprin‌‌‌‌‌‌‌g Boot 应用‍‍‍‍‍‍‍中集成和使用 Ch‍‍‍‍‍‍‍roma,先导入 ‍‍‍‍‍‍‍Chroma 的依赖:

xml
▼xml复制代码<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-chroma</artifactId>
  <version>1.2.0-beta8</version>
</dependency>

我们来看下示例代码:

java
▼java复制代码package com.yupi.vectordb.config;

import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.chroma.ChromaEmbeddingStore;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

/**
 * 剪切助手 Chroma 向量数据库配置
 */
@Configuration
public class ChromaConfig {
    
    @Bean
    public EmbeddingModel embeddingModel() {
        return new BgeSmallZhEmbeddingModel();
    }
    
    @Bean
    public EmbeddingStore<TextSegment> chromaEmbeddingStore() {
        return ChromaEmbeddingStore.builder()
                .baseUrl("http://localhost:8000")  // Chroma 服务地址
                .collectionName("code_snippets")   // 集合名称
                .build();
    }
}

接下来创建‍‍‍‍‍‍‍一个完整的‌‌‌向‌量‌数‌据‌库‍‍‍服务‍,展‍示如‍‍‍‍何进‍‍行数据‍的‍‍‍增删‍改‍查操‍作‍:

java
▼java复制代码import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingSearchRequest;
import dev.langchain4j.store.embedding.EmbeddingSearchResult;
import dev.langchain4j.store.embedding.EmbeddingStore;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;

/**
 * 编程导航 Chroma 向量数据库服务
 * 提供文档的向量化存储和语义检索功能
 */
@Service
public class ChromaVectorService {

    @Autowired
    private EmbeddingModel embeddingModel;

    @Autowired
    private EmbeddingStore<TextSegment> embeddingStore;

    /**
     * 添加文档到向量数据库
     */
    public String addDocument(String text, Map<String, String> metadataMap) {
        // 创建文本片段
        Metadata metadata = new Metadata(metadataMap);
        TextSegment segment = TextSegment.from(text, metadata);

        // 生成向量
        Embedding embedding = embeddingModel.embed(segment).content();

        // 生成唯一ID
        String documentId = UUID.randomUUID().toString();
        segment.metadata().put("documentId", documentId);

        // 存储到向量数据库
        embeddingStore.add(embedding, segment);

        return documentId;
    }

    /**
     * 批量添加文档
     */
    public List<String> addDocuments(List<String> texts, String category) {
        return texts.stream()
                .map(text -> {
                    Map<String, String> metadata = new HashMap<>();
                    metadata.put("category", category);
                    metadata.put("timestamp", String.valueOf(System.currentTimeMillis()));
                    return addDocument(text, metadata);
                })
                .toList();
    }

    /**
     * 语义搜索
     */
    public List<TextSegment> semanticSearch(String query, int maxResults) {
        // 将查询向量化
        Embedding queryEmbedding = embeddingModel.embed(query).content();

        EmbeddingSearchRequest searchRequest = EmbeddingSearchRequest.builder()
                .queryEmbedding(queryEmbedding)
                .maxResults(maxResults * 2)
                .build();
        EmbeddingSearchResult<TextSegment> searchResult = embeddingStore.search(searchRequest);
        List<EmbeddingMatch<TextSegment>> matches = searchResult.matches();

        // 过滤并返回结果
        return matches.stream()
                .filter(match -> match.score() > 0.5)  // 过滤低相关度结果
                .limit(maxResults)
                .map(EmbeddingMatch::embedded)
                .toList();
    }

    /**
     * 带过滤条件的语义搜索
     */
    public List<TextSegment> searchWithFilter(String query, String category, int maxResults) {
        List<TextSegment> allResults = semanticSearch(query, maxResults * 3);

        return allResults.stream()
                .filter(segment -> category.equals(segment.metadata().getString("category")))
                .limit(maxResults)
                .toList();
    }

    /**
     * 获取相似文档
     */
    public List<TextSegment> findSimilarDocuments(String documentText, int maxResults) {
        return semanticSearch(documentText, maxResults);
    }

    /**
     * 演示 Chroma 向量数据库的使用
     */
    public void demonstrateChromaUsage() {
        // 准备示例数据
        List<String> javaCode = List.of(
                "public class HelloWorld { public static void main(String[] args) { System.out.println(\"Hello World\"); } }",
                "public int fibonacci(int n) { if (n <= 1) return n; return fibonacci(n-1) + fibonacci(n-2); }",
                "public void bubbleSort(int[] arr) { int n = arr.length; for (int i = 0; i < n-1; i++) { for (int j = 0; j < n-i-1; j++) { if (arr[j] > arr[j+1]) { int temp = arr[j]; arr[j] = arr[j+1]; arr[j+1] = temp; } } } }"
        );

        List<String> pythonCode = List.of(
                "def hello_world(): print('Hello World')",
                "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
                "def bubble_sort(arr): n = len(arr); for i in range(n): for j in range(0, n-i-1): if arr[j] > arr[j+1]: arr[j], arr[j+1] = arr[j+1], arr[j]"
        );

        // 添加文档到数据库
        System.out.println("添加 Java 代码片段...");
        List<String> javaIds = addDocuments(javaCode, "java");
        System.out.println("Java 代码片段已添加,ID: " + javaIds);

        System.out.println("添加 Python 代码片段...");
        List<String> pythonIds = addDocuments(pythonCode, "python");
        System.out.println("Python 代码片段已添加,ID: " + pythonIds);

        // 进行语义搜索
        System.out.println("\n搜索:排序算法");
        List<TextSegment> sortResults = semanticSearch("排序算法", 3);
        for (int i = 0; i < sortResults.size(); i++) {
            TextSegment segment = sortResults.get(i);
            System.out.printf("%d. [%s] %s%n",
                    i + 1,
                    segment.metadata().getString("category"),
                    segment.text().substring(0, Math.min(50, segment.text().length())) + "..."
            );
        }

        // 按类别过滤搜索
        System.out.println("\n在 Python 代码中搜索:Hello World");
        List<TextSegment> pythonResults = searchWithFilter("Hello World", "python", 2);
        for (TextSegment segment : pythonResults) {
            System.out.println("找到 Python 代码:" + segment.text());
        }
    }
}

这段程序输‍‍出结果:    ‌ ‌       ‍  ‍      ‍   ‍     ‍

plain
▼plain复制代码添加 Java 代码片段...
Java 代码片段已添加,ID: [a1b2c3d4-e5f6-7890-abcd-ef1234567890, b2c3d4e5-f6g7-8901-bcde-f23456789012, c3d4e5f6-g7h8-9012-cdef-345678901234]
添加 Python 代码片段...
Python 代码片段已添加,ID: [d4e5f6g7-h8i9-0123-defg-456789012345, e5f6g7h8-i9j0-1234-efgh-567890123456, f6g7h8i9-j0k1-2345-fghi-678901234567]

搜索:排序算法
1. [java] public void bubbleSort(int[] arr) { int n = arr.leng...
2. [python] def bubble_sort(arr): n = len(arr); for i in range...
3. [java] public int fibonacci(int n) { if (n <= 1) return n...

在 Python 代码中搜索:Hello World
找到 Python 代码:def hello_world(): print('Hello World')

Chroma 还支持更‍‍‍‍‍‍‍高级的功能,如向量索引优化、持久化存储和分‌‌‌‌‌‌‌布式部署。在生产环境中,我们通常会配置持久‍‍‍‍‍‍‍化存储来保证数据安全,并根据数据规模选择合‍‍‍‍‍‍‍适的索引策略               ‍‍‍‍‍‍‍

13.3 Pinecone 集成

Pinecone 是一个‍‍‍‍‍‍‍专业的云端向量数据库服务,提供高性能的向量存储和‌‌‌‌‌‌‌检索能力。它专为大规模 AI 应用设计,支持数十‍‍‍‍‍‍‍亿级别的向量检索,具有出色的性能和稳定性。在老鱼‍‍‍‍‍‍‍简历的智能简历匹配系统中,我们使用 Pineco‍‍‍‍‍‍‍ne 来存储和匹配职位要求与简历内容。

Pinecone 的主要‍‍优势‍‍‍‍‍包括完全托管的云服务、自动扩缩容、高可用性保‌‌证和企业级安全‌‌‌‌‌性。它提供了 RESTful AP‍‍I 和多种语言的 SDK,让‍‍‍‍‍开发者能够轻松集成向量搜‍‍索功能。同时,Pinecone 还支‍‍‍‍‍持元数据‍‍过滤、近似最近邻搜索和实时索引更新         ‍‍‍‍‍

要使用 Pi‍‍‍‍ne‍‍‍cone,我们首先需‌‌‌‌要添加相应的依赖和‌‌‌配置,‍‍‍‍首先导入依赖:     ‍‍‍‍‍‍‍            ‍‍‍‍       ‍‍‍                       ‍‍‍

xml
▼xml复制代码<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-pinecone</artifactId>
  <version>1.2.0-beta8</version>
</dependency>

下面来看下示例代码:

java
▼java复制代码import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.pinecone.PineconeEmbeddingStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

/**
 * 算法导航 Pinecone 向量数据库配置
 */
@Configuration
public class PineconeConfig {

    @Value("${pinecone.api-key}")
    private String pineconeApiKey;

    @Value("${pinecone.environment:us-west1-gcp}")
    private String pineconeEnvironment;

    @Value("${pinecone.index-name:algorithm-docs}")
    private String indexName;

    @Bean("bgeSmallZhEmbeddingModel")
    public EmbeddingModel embeddingModel() {
        return new BgeSmallZhEmbeddingModel();
    }

    @Bean
    public EmbeddingStore<TextSegment> pineconeEmbeddingStore() {
        return PineconeEmbeddingStore.builder()
                .apiKey(pineconeApiKey)
                .environment(pineconeEnvironment)
                .index(indexName)
                .nameSpace("main")  // 命名空间,用于逻辑分区
                .build();
    }
}

现在创建一‍个‍‍‍‍‍‍ Pinecon‌e 服务‌‌‌‌‌类,展示如‍‌何进行高级的向‍‍‍‍‍量操‍作:       ‍ ‍‍‍‍‍             ‍‍‍‍‍‍                        ‍

java
▼java复制代码package com.yupi.mcp.mcpclient.controller;



import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingSearchRequest;
import dev.langchain4j.store.embedding.EmbeddingSearchResult;
import dev.langchain4j.store.embedding.EmbeddingStore;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.concurrent.CompletableFuture;
import java.util.stream.Collectors;

/**
 * 面试鸭 Pinecone 云端向量数据库服务
 * 提供企业级向量存储和检索能力
 */
@Service
public class PineconeVectorService {

    @Autowired
    private EmbeddingModel embeddingModel;


    @Qualifier("bgeSmallZhEmbeddingModel")
    private EmbeddingStore<TextSegment> pineconeStore;

    /**
     * 批量上传文档到 Pinecone
     */
    public CompletableFuture<List<String>> batchUpload(List<String> documents, String category) {
        return CompletableFuture.supplyAsync(() -> {
            List<String> documentIds = new ArrayList<>();

            for (int i = 0; i < documents.size(); i++) {
                String document = documents.get(i);
                String documentId = category + "_" + System.currentTimeMillis() + "_" + i;

                // 创建元数据
                Map<String, String> metadataMap = new HashMap<>();
                metadataMap.put("category", category);
                metadataMap.put("documentId", documentId);
                metadataMap.put("index", String.valueOf(i));
                metadataMap.put("length", String.valueOf(document.length()));


                // 创建文本片段
                Metadata metadata = new Metadata(metadataMap);
                TextSegment segment = TextSegment.from(document, metadata);

                // 生成向量并存储
                Embedding embedding = embeddingModel.embed(segment).content();
                pineconeStore.add(embedding, segment);

                documentIds.add(documentId);
            }

            return documentIds;
        });
    }

    /**
     * 智能问答检索
     */
    public List<TextSegment> intelligentRetrieve(String query, String category, int topK) {
        // 向量化查询
        Embedding queryEmbedding = embeddingModel.embed(query).content();

        // 执行相似度搜索
        EmbeddingSearchRequest searchRequest = EmbeddingSearchRequest.builder()
                .queryEmbedding(queryEmbedding)
                .maxResults(topK * 2)
                .build();
        EmbeddingSearchResult<TextSegment> searchResult = pineconeStore.search(searchRequest);
        List<EmbeddingMatch<TextSegment>> allMatches = searchResult.matches();


        // 根据类别和相似度进行过滤和排序
        return allMatches.stream()
                .filter(match -> match.score() > 0.7)  // 高相似度阈值
                .filter(match -> {
                    TextSegment segment = match.embedded();
                    return category == null || category.equals(segment.metadata().getString("category"));
                })
                .sorted((a, b) -> Double.compare(b.score(), a.score()))
                .limit(topK)
                .map(EmbeddingMatch::embedded)
                .collect(Collectors.toList());
    }

    /**
     * 混合搜索:结合关键词和语义搜索
     */
    public List<TextSegment> hybridSearch(String query, List<String> keywords, int maxResults) {
        // 语义搜索结果
        List<TextSegment> semanticResults = intelligentRetrieve(query, null, maxResults);

        // 关键词过滤
        if (keywords != null && !keywords.isEmpty()) {
            semanticResults = semanticResults.stream()
                    .filter(segment -> containsAnyKeyword(segment.text(), keywords))
                    .collect(Collectors.toList());
        }

        return semanticResults;
    }

    /**
     * 检查文本是否包含任何关键词
     */
    private boolean containsAnyKeyword(String text, List<String> keywords) {
        String lowerText = text.toLowerCase();
        return keywords.stream()
                .anyMatch(keyword -> lowerText.contains(keyword.toLowerCase()));
    }

    /**
     * 相似度聚类分析
     */
    public Map<String, List<TextSegment>> clusterSimilarDocuments(String query, double threshold) {
        Embedding queryEmbedding = embeddingModel.embed(query).content();

        // 获取大量相关文档
        // 执行相似度搜索
        EmbeddingSearchRequest searchRequest = EmbeddingSearchRequest.builder()
                .queryEmbedding(queryEmbedding)
                .maxResults(50)
                .build();
        EmbeddingSearchResult<TextSegment> searchResult = pineconeStore.search(searchRequest);
        List<EmbeddingMatch<TextSegment>> matches = searchResult.matches();
        Map<String, List<TextSegment>> clusters = new HashMap<>();
        clusters.put("高相关", new ArrayList<>());
        clusters.put("中等相关", new ArrayList<>());
        clusters.put("低相关", new ArrayList<>());

        for (EmbeddingMatch<TextSegment> match : matches) {
            double score = match.score();
            TextSegment segment = match.embedded();

            if (score >= threshold + 0.1) {
                clusters.get("高相关").add(segment);
            } else if (score >= threshold) {
                clusters.get("中等相关").add(segment);
            } else if (score >= threshold - 0.1) {
                clusters.get("低相关").add(segment);
            }
        }

        return clusters;
    }

    /**
     * 演示 Pinecone 高级功能
     */
    public void demonstratePineconeFeatures() {
        // 准备不同类型的文档
        List<String> algorithmDocs = Arrays.asList(
                "快速排序是一种高效的排序算法,平均时间复杂度为O(nlogn)",
                "二分查找适用于有序数组,时间复杂度为O(logn)",
                "深度优先搜索是图遍历的基本算法,使用栈或递归实现"
        );

        List<String> dataStructureDocs = Arrays.asList(
                "链表是一种线性数据结构,支持动态内存分配",
                "二叉树是每个节点最多有两个子节点的树结构",
                "哈希表通过哈希函数实现O(1)的平均查找时间"
        );

        try {
            // 异步批量上传
            System.out.println("上传算法文档到 Pinecone...");
            CompletableFuture<List<String>> algorithmFuture = batchUpload(algorithmDocs, "algorithm");

            System.out.println("上传数据结构文档到 Pinecone...");
            CompletableFuture<List<String>> dataStructureFuture = batchUpload(dataStructureDocs, "data_structure");

            // 等待上传完成
            List<String> algorithmIds = algorithmFuture.get();
            List<String> dataStructureIds = dataStructureFuture.get();

            System.out.println("算法文档 ID: " + algorithmIds);
            System.out.println("数据结构文档 ID: " + dataStructureIds);

            // 智能检索演示
            System.out.println("\n智能检索:查找排序相关内容");
            List<TextSegment> sortResults = intelligentRetrieve("排序算法", null, 3);
            for (int i = 0; i < sortResults.size(); i++) {
                TextSegment segment = sortResults.get(i);
                System.out.printf("%d. [%s] %s%n",
                        i + 1,
                        segment.metadata().getString("category"),
                        segment.text()
                );
            }

            // 混合搜索演示
            System.out.println("\n混合搜索:查找包含'时间复杂度'的内容");
            List<TextSegment> hybridResults = hybridSearch("算法效率",
                    Arrays.asList("时间复杂度", "O("), 2);
            for (TextSegment segment : hybridResults) {
                System.out.println("找到:" + segment.text());
            }

            // 聚类分析演示
            System.out.println("\n相似度聚类分析:");
            Map<String, List<TextSegment>> clusters = clusterSimilarDocuments("数据结构", 0.7);
            clusters.forEach((level, segments) -> {
                System.out.println(level + "(" + segments.size() + "个文档):");
                segments.forEach(segment ->
                        System.out.println("  - " + segment.text().substring(0, Math.min(30, segment.text().length())) + "...")
                );
            });

        } catch (Exception e) {
            System.err.println("Pinecone 操作失败: " + e.getMessage());
        }
    }
}

这段程序输‍‍出结果:    ‌ ‌       ‍  ‍      ‍   ‍     ‍

plain
▼plain复制代码上传算法文档到 Pinecone...
上传数据结构文档到 Pinecone...
算法文档 ID: [algorithm_1672531200000_0, algorithm_1672531200001_1, algorithm_1672531200002_2]
数据结构文档 ID: [data_structure_1672531200003_0, data_structure_1672531200004_1, data_structure_1672531200005_2]

智能检索:查找排序相关内容
1. [algorithm] 快速排序是一种高效的排序算法,平均时间复杂度为O(nlogn)
2. [algorithm] 二分查找适用于有序数组,时间复杂度为O(logn)
3. [data_structure] 哈希表通过哈希函数实现O(1)的平均查找时间

混合搜索:查找包含'时间复杂度'的内容
找到:快速排序是一种高效的排序算法,平均时间复杂度为O(nlogn)
找到:二分查找适用于有序数组,时间复杂度为O(logn)

相似度聚类分析:
高相关(2个文档):
  - 链表是一种线性数据结构,支持动态内存分配...
  - 二叉树是每个节点最多有两个子节点的树结构...
中等相关(1个文档):
  - 哈希表通过哈希函数实现O(1)的平均查找时间...
低相关(0个文档):

Pinecone‍‍‍ ‍还提供‍‍‍了更多企业级功能,如数‌‌‌据备份恢复、访‌问控制、监控告‍‍‌‌‍警‌等。在选择向量数据库时,需‍要根‍‍‍据数据规模、性能需求‍‍‍、成本预算‍‍‍和‍技术团队能力来综合考虑          ‍‍‍   ‍                           ‍‍‍

13.4 Weaviate 使用

Weaviate 是一个开源‍‍‍‍‍‍‍的向量数据库,它将向量搜索与传统数据库功能相结合,支持‌‌‌‌‌‌‌图数据库特性和复杂的查询操作。Weaviate 的独特‍‍‍‍‍‍‍之处在于它提供了语义搜索、推荐系统和知识图谱的统一‍‍‍‍‍‍平台‍。在剪切助手的跨设备内容同步系统中,我们使用 Weav‍‍‍‍‍‍‍iate 来实现智能内容分类和推荐。

Weaviate 支持多种‍‍向量‍‍‍‍‍化模型,包括 OpenAI、Cohere、Hu‌‌gging Fa‌‌‌‌‌ce 等主流模型。它还提供了 Gr‍‍aphQL 查询接口,让开发‍‍‍‍‍者能够执行复杂的语‍‍义查询。Weaviate 的模块化架构支持自‍‍‍‍‍定义扩展,‍‍可以根据具体需求添加不同的功能模块         ‍‍‍‍‍

让我们看看‍‍‍‍‍‍‍如何配置‌‌和‌使‌用‌ ‌W‍‍‌e‍av‍ia‍‍‍te‍‍,我‍‍们‍‍先导‍入‍依赖‍:

xml
▼xml复制代码<dependency>
  <groupId>dev.langchain4j</groupId>
  <artifactId>langchain4j-weaviate</artifactId>
  <version>1.2.0-beta8</version>
</dependency>

下面来看下示例代码:

java
▼java复制代码package com.yupi.vectordb.config;

import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.weaviate.WeaviateEmbeddingStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.util.Arrays;

/**
 * 代码小抄 Weaviate 向量数据库配置
 */
@Configuration
public class WeaviateConfig {

    @Value("${weaviate.host:http://localhost}")
    private String weaviateHost;
    @Value("${weaviate.port:8080}")
    private int weaviatePort;
    @Value("${weaviate.api-key:}")
    private String apiKey;

    @Bean
    public EmbeddingModel embeddingModel() {
        return new BgeSmallZhEmbeddingModel();
    }

    @Bean("weaviateEmbeddingStore")
    public EmbeddingStore<TextSegment> weaviateEmbeddingStore() {
        WeaviateEmbeddingStore.WeaviateEmbeddingStoreBuilder builder = WeaviateEmbeddingStore.builder()
                .host(weaviateHost)
                .port(weaviatePort)
                .objectClass("CodeSnippet")  // Weaviate 类名
                .textFieldName("content")        // 文本内容字段
                .metadataKeys(Arrays.asList("language", "category", "author", "timestamp"));

        if (!apiKey.isEmpty()) {
            builder.apiKey(apiKey);
        }

        return builder.build();
    }
}

创建 We‍‍‍‍av‍iat‍‍e 服务‌类,‌‌‌展示其高级查询‍‌和图数据‍‍‍库功能‌‌:        ‍‍‍‍            ‍‍‍ ‍‍    ‍                     ‍  ‍‍                             ‍‍

java
▼java复制代码package com.yupi.vectordb.service;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingStore;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.stream.Collectors;

/**
 * 编程导航 Weaviate 向量数据库服务
 * 提供语义搜索和知识图谱功能
 */
@Service
public class WeaviateVectorService {
    
    @Autowired
    private EmbeddingModel embeddingModel;
    
    @Autowired
    private EmbeddingStore<TextSegment> weaviateStore;
    
    /**
     * 存储代码片段
     */
    public String storeCodeSnippet(String code, String language, String category, String author) {
        Map<String, String> metadata = new HashMap<>();
        metadata.put("language", language);
        metadata.put("category", category);
        metadata.put("author", author);
        metadata.put("timestamp", String.valueOf(System.currentTimeMillis()));
        metadata.put("codeId", UUID.randomUUID().toString());
        
        TextSegment segment = TextSegment.from(code, metadata);
        Embedding embedding = embeddingModel.embed(segment).content();
        
        weaviateStore.add(embedding, segment);
        
        return metadata.get("codeId");
    }
    
    /**
     * 语义代码搜索
     */
    public List<CodeResult> searchCode(String query, String language, int maxResults) {
        Embedding queryEmbedding = embeddingModel.embed(query).content();
        
        List<EmbeddingMatch<TextSegment>> matches = weaviateStore.findRelevant(
                queryEmbedding, 
                maxResults * 2
        );
        
        return matches.stream()
                .filter(match -> match.score() > 0.6)
                .filter(match -> language == null || 
                    language.equals(match.embedded().metadata("language")))
                .limit(maxResults)
                .map(match -> new CodeResult(
                    match.embedded().text(),
                    match.embedded().metadata("language"),
                    match.embedded().metadata("category"),
                    match.embedded().metadata("author"),
                    match.score()
                ))
                .collect(Collectors.toList());
    }
    
    /**
     * 代码推荐系统
     */
    public List<CodeResult> recommendSimilarCode(String referenceCode, int maxResults) {
        // 基于参考代码找到相似的代码片段
        List<CodeResult> similarCode = searchCode(referenceCode, null, maxResults * 2);
        
        // 根据多个因素进行推荐排序
        return similarCode.stream()
                .sorted((a, b) -> {
                    // 综合考虑相似度、作者权威性、时间新旧等因素
                    double scoreA = a.similarity * 0.7 + getAuthorScore(a.author) * 0.2 + getRecencyScore(a) * 0.1;
                    double scoreB = b.similarity * 0.7 + getAuthorScore(b.author) * 0.2 + getRecencyScore(b) * 0.1;
                    return Double.compare(scoreB, scoreA);
                })
                .limit(maxResults)
                .collect(Collectors.toList());
    }
    
    /**
     * 多模态查询:支持代码+自然语言查询
     */
    public List<CodeResult> multiModalSearch(String naturalQuery, String codeContext, int maxResults) {
        // 合并自然语言查询和代码上下文
        String combinedQuery = naturalQuery;
        if (codeContext != null && !codeContext.isEmpty()) {
            combinedQuery += " " + codeContext;
        }
        
        return searchCode(combinedQuery, null, maxResults);
    }
    
    /**
     * 代码分类和标签提取
     */
    public Map<String, Object> analyzeCode(String code) {
        Map<String, Object> analysis = new HashMap<>();
        
        // 基本统计信息
        analysis.put("length", code.length());
        analysis.put("lines", code.split("\n").length);
        
        // 简单的语言检测
        String detectedLanguage = detectLanguage(code);
        analysis.put("detectedLanguage", detectedLanguage);
        
        // 复杂度估算
        int complexity = estimateComplexity(code);
        analysis.put("complexity", complexity);
        
        // 功能分类
        String category = categorizeCode(code);
        analysis.put("category", category);
        
        return analysis;
    }
    
    // 辅助方法
    private double getAuthorScore(String author) {
        // 根据作者的历史贡献计算权威性分数
        Map<String, Double> authorScores = Map.of(
            "程序员鱼皮", 0.9,
            "编程导航", 0.8,
            "面试鸭", 0.7
        );
        return authorScores.getOrDefault(author, 0.5);
    }
    
    private double getRecencyScore(CodeResult result) {
        // 根据代码的时间新旧程度计算分数
        // 简化实现,实际应该解析 timestamp
        return 0.5;
    }
    
    private String detectLanguage(String code) {
        if (code.contains("public class") || code.contains("import java")) return "java";
        if (code.contains("def ") || code.contains("import ")) return "python";
        if (code.contains("function") || code.contains("const ")) return "javascript";
        return "unknown";
    }
    
    private int estimateComplexity(String code) {
        int complexity = 1;
        complexity += code.split("if ").length - 1;
        complexity += code.split("for ").length - 1;
        complexity += code.split("while ").length - 1;
        return complexity;
    }
    
    private String categorizeCode(String code) {
        if (code.contains("sort") || code.contains("Sort")) return "sorting";
        if (code.contains("search") || code.contains("find")) return "searching";
        if (code.contains("Tree") || code.contains("Node")) return "data_structure";
        return "general";
    }
    
    /**
     * 演示 Weaviate 功能
     */
    public void demonstrateWeaviateFeatures() {
        // 存储示例代码片段
        System.out.println("存储代码片段到 Weaviate...");
        
        String javaCode = "public int binarySearch(int[] arr, int target) { int left = 0, right = arr.length - 1; while (left <= right) { int mid = left + (right - left) / 2; if (arr[mid] == target) return mid; if (arr[mid] < target) left = mid + 1; else right = mid - 1; } return -1; }";
        String pythonCode = "def binary_search(arr, target): left, right = 0, len(arr) - 1; while left <= right: mid = (left + right) // 2; if arr[mid] == target: return mid; elif arr[mid] < target: left = mid + 1; else: right = mid - 1; return -1";
        String jsCode = "function binarySearch(arr, target) { let left = 0, right = arr.length - 1; while (left <= right) { let mid = Math.floor((left + right) / 2); if (arr[mid] === target) return mid; if (arr[mid] < target) left = mid + 1; else right = mid - 1; } return -1; }";
        
        String javaId = storeCodeSnippet(javaCode, "java", "searching", "程序员鱼皮");
        String pythonId = storeCodeSnippet(pythonCode, "python", "searching", "编程导航");
        String jsId = storeCodeSnippet(jsCode, "javascript", "searching", "面试鸭");
        
        System.out.println("Java 代码 ID: " + javaId);
        System.out.println("Python 代码 ID: " + pythonId);
        System.out.println("JavaScript 代码 ID: " + jsId);
        
        // 语义搜索演示
        System.out.println("\n语义搜索:查找二分查找算法");
        List<CodeResult> searchResults = searchCode("二分查找算法", null, 3);
        for (int i = 0; i < searchResults.size(); i++) {
            CodeResult result = searchResults.get(i);
            System.out.printf("%d. [%s] 作者:%s 相似度:%.3f%n", 
                i + 1, result.language, result.author, result.similarity);
            System.out.println("   代码:" + result.code.substring(0, Math.min(50, result.code.length())) + "...");
        }
        
        // 多模态查询演示
        System.out.println("\n多模态查询:在数组中查找元素的高效方法");
        List<CodeResult> multiModalResults = multiModalSearch(
            "在数组中查找元素的高效方法", 
            "int[] array = {1,2,3,4,5};", 
            2
        );
        for (CodeResult result : multiModalResults) {
            System.out.println("找到 " + result.language + " 代码,作者:" + result.author);
        }
        
        // 代码分析演示
        System.out.println("\n代码分析:");
        Map<String, Object> analysis = analyzeCode(javaCode);
        analysis.forEach((key, value) -> System.out.println(key + ": " + value));
    }
    
    // 代码搜索结果类
    public static class CodeResult {
        public final String code;
        public final String language;
        public final String category;
        public final String author;
        public final double similarity;
        
        public CodeResult(String code, String language, String category, String author, double similarity) {
            this.code = code;
            this.language = language;
            this.category = category;
            this.author = author;
            this.similarity = similarity;
        }
    }
}

这段程序输出结果:

plain
▼plain复制代码存储代码片段到 Weaviate...
Java 代码 ID: f47ac10b-58cc-4372-a567-0e02b2c3d479
Python 代码 ID: 6ba7b810-9dad-11d1-80b4-00c04fd430c8
JavaScript 代码 ID: 6ba7b811-9dad-11d1-80b4-00c04fd430c8

语义搜索:查找二分查找算法
1. [java] 作者:程序员鱼皮 相似度:0.892
   代码:public int binarySearch(int[] arr, int target) { int...
2. [python] 作者:编程导航 相似度:0.876
   代码:def binary_search(arr, target): left, right = 0, len...
3. [javascript] 作者:面试鸭 相似度:0.851
   代码:function binarySearch(arr, target) { let left = 0, r...

多模态查询:在数组中查找元素的高效方法
找到 java 代码,作者:程序员鱼皮
找到 python 代码,作者:编程导航

代码分析:
length: 256
lines: 1
detectedLanguage: java
complexity: 3
category: searching

Weaviate 的图数据库特性让它特别适合构建知识图谱和推荐系统。通过将向量搜索与图查询相结合,我们可以发现数据之间的复杂关系,实现更智能的内容推荐和知识发现。

13.5 嵌入模型选择与优化

选择合适的嵌入模型是构建高质量向量搜索系统的关键。不同的嵌入模型在语义理解能力、性能表现、资源消耗等方面存在显著差异。在算法导航的学习内容推荐系统中,我们需要根据具体的应用场景和性能要求来选择最适合的嵌入模型。

嵌入模型的选择需要考虑多个维度:模型的语义理解能力决定了搜索结果的准确性;模型的处理速度影响系统的响应时间;模型的资源消耗关系到部署成本;模型对特定领域的适应性影响在垂直场景下的表现。同时,还需要考虑模型的更新频率、社区支持和商业授权等因素。

让我们创建一个嵌入模型管理和优化服务:

java
▼java复制代码package com.yupi.vectordb.service;

import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallenv15q.BgeSmallEnV15QuantizedEmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

/**
 * 老鱼简历嵌入模型管理和优化服务
 * 提供多模型管理、性能监控和自动优化功能
 */
@Service
public class EmbeddingModelOptimizer {

    // 可用的嵌入模型
    private final Map<String, EmbeddingModel> availableModels;

    // 模型性能统计
    private final Map<String, ModelPerformance> performanceStats;

    // 缓存层
    private final Map<String, CachedEmbedding> embeddingCache;

    // 定时任务调度器
    private final ScheduledExecutorService scheduler;

    public EmbeddingModelOptimizer() {
        this.availableModels = new HashMap<>();
        this.performanceStats = new ConcurrentHashMap<>();
        this.embeddingCache = new ConcurrentHashMap<>();
        this.scheduler = Executors.newScheduledThreadPool(2);

        initializeModels();
        startPerformanceMonitoring();
    }

    /**
     * 初始化可用的嵌入模型
     */
    private void initializeModels() {
        // 轻量级英文模型
        availableModels.put("all-minilm-l6-v2", new BgeSmallEnV15QuantizedEmbeddingModel());

        // 中文优化模型
        availableModels.put("bge-small-zh", new BgeSmallZhEmbeddingModel());

        // 为每个模型初始化性能统计
        for (String modelName : availableModels.keySet()) {
            performanceStats.put(modelName, new ModelPerformance(modelName));
        }
    }

    /**
     * 智能选择最佳模型
     */
    public String selectOptimalModel(String text, String language, String domain) {
        // 根据文本特征选择模型
        if ("zh".equals(language) || containsChinese(text)) {
            return "bge-small-zh";  // 中文内容优先使用中文模型
        }

        if ("en".equals(language) || text.length() < 100) {
            return "all-minilm-l6-v2";  // 英文或短文本使用轻量级模型
        }

        // 根据历史性能选择
        return getBestPerformingModel();
    }

    /**
     * 优化的文本嵌入方法
     */
    public Embedding embedWithOptimization(String text, String preferredModel) {
        // 检查缓存
        String cacheKey = generateCacheKey(text, preferredModel);
        CachedEmbedding cached = embeddingCache.get(cacheKey);

        if (cached != null && !cached.isExpired()) {
            return cached.embedding;
        }

        // 选择最佳模型
        String selectedModel = preferredModel != null ? preferredModel :
                selectOptimalModel(text, null, null);

        // 记录性能
        long startTime = System.currentTimeMillis();

        try {
            // 执行嵌入
            EmbeddingModel model = availableModels.get(selectedModel);
            Embedding embedding = model.embed(text).content();

            // 更新性能统计
            long duration = System.currentTimeMillis() - startTime;
            updatePerformanceStats(selectedModel, duration, true);

            // 缓存结果
            embeddingCache.put(cacheKey, new CachedEmbedding(embedding, System.currentTimeMillis()));

            return embedding;

        } catch (Exception e) {
            // 更新错误统计
            updatePerformanceStats(selectedModel,
                    System.currentTimeMillis() - startTime, false);
            throw e;
        }
    }

    /**
     * 批量嵌入优化
     */
    public List<Embedding> batchEmbedWithOptimization(List<String> texts, String preferredModel) {
        List<Embedding> results = new ArrayList<>();

        // 按模型分组处理
        Map<String, List<String>> modelGroups = new HashMap<>();

        for (String text : texts) {
            String selectedModel = preferredModel != null ? preferredModel :
                    selectOptimalModel(text, null, null);

            modelGroups.computeIfAbsent(selectedModel, k -> new ArrayList<>()).add(text);
        }

        // 并行处理不同模型的文本
        modelGroups.forEach((modelName, textList) -> {
            EmbeddingModel model = availableModels.get(modelName);
            for (String text : textList) {
                results.add(embedWithOptimization(text, modelName));
            }
        });

        return results;
    }

    /**
     * 模型性能基准测试
     */
    public Map<String, BenchmarkResult> runBenchmark(List<String> testTexts) {
        Map<String, BenchmarkResult> results = new HashMap<>();

        for (String modelName : availableModels.keySet()) {
            System.out.println("测试模型: " + modelName);

            long totalTime = 0;
            int successCount = 0;
            List<Double> similarities = new ArrayList<>();

            EmbeddingModel model = availableModels.get(modelName);

            for (int i = 0; i < testTexts.size(); i++) {
                try {
                    long startTime = System.currentTimeMillis();
                    Embedding embedding = model.embed(testTexts.get(i)).content();
                    long duration = System.currentTimeMillis() - startTime;

                    totalTime += duration;
                    successCount++;

                    // 计算与第一个文本的相似度(简单测试)
                    if (i > 0) {
                        Embedding firstEmbedding = model.embed(testTexts.get(0)).content();
                        double similarity = cosineSimilarity(embedding.vector(), firstEmbedding.vector());
                        similarities.add(similarity);
                    }

                } catch (Exception e) {
                    System.err.println("模型 " + modelName + " 处理失败: " + e.getMessage());
                }
            }

            double avgTime = successCount > 0 ? (double) totalTime / successCount : 0;
            double avgSimilarity = similarities.stream().mapToDouble(Double::doubleValue).average().orElse(0);

            results.put(modelName, new BenchmarkResult(
                    avgTime, successCount, testTexts.size(), avgSimilarity
            ));
        }

        return results;
    }

    /**
     * 嵌入质量评估
     */
    public double evaluateEmbeddingQuality(String text1, String text2, String expectedSimilarity, String modelName) {
        EmbeddingModel model = availableModels.get(modelName);

        Embedding embedding1 = model.embed(text1).content();
        Embedding embedding2 = model.embed(text2).content();

        double actualSimilarity = cosineSimilarity(embedding1.vector(), embedding2.vector());
        double expected = "high".equals(expectedSimilarity) ? 0.8 :
                "medium".equals(expectedSimilarity) ? 0.5 : 0.2;

        // 计算准确性分数
        return 1.0 - Math.abs(actualSimilarity - expected);
    }

    // 辅助方法
    private boolean containsChinese(String text) {
        return text.chars().anyMatch(ch -> ch >= 0x4e00 && ch <= 0x9fff);
    }

    private String getBestPerformingModel() {
        return performanceStats.entrySet().stream()
                .min(Comparator.comparingDouble(entry -> entry.getValue().getAverageTime()))
                .map(Map.Entry::getKey)
                .orElse("all-minilm-l6-v2");
    }

    private String generateCacheKey(String text, String model) {
        return model + ":" + text.hashCode();
    }

    private void updatePerformanceStats(String modelName, long duration, boolean success) {
        ModelPerformance stats = performanceStats.get(modelName);
        if (stats != null) {
            stats.recordExecution(duration, success);
        }
    }

    private double cosineSimilarity(float[] vectorA, float[] vectorB) {
        double dotProduct = 0.0;
        double normA = 0.0;
        double normB = 0.0;

        for (int i = 0; i < vectorA.length; i++) {
            dotProduct += vectorA[i] * vectorB[i];
            normA += Math.pow(vectorA[i], 2);
            normB += Math.pow(vectorB[i], 2);
        }

        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }

    private void startPerformanceMonitoring() {
        scheduler.scheduleAtFixedRate(() -> {
            // 清理过期缓存
            embeddingCache.entrySet().removeIf(entry -> entry.getValue().isExpired());

            // 输出性能报告
            printPerformanceReport();
        }, 5, 5, TimeUnit.MINUTES);
    }

    private void printPerformanceReport() {
        System.out.println("\n=== 嵌入模型性能报告 ===");
        performanceStats.forEach((name, stats) -> {
            System.out.printf("%s: 平均时间=%.2fms, 成功率=%.1f%%, 总调用=%d%n",
                    name, stats.getAverageTime(), stats.getSuccessRate() * 100, stats.getTotalCalls());
        });
        System.out.println("缓存大小: " + embeddingCache.size());
    }

    /**
     * 演示嵌入模型优化功能
     */
    public void demonstrateOptimization() {
        List<String> testTexts = Arrays.asList(
                "Java 是一种强类型的面向对象编程语言",
                "Python 具有简洁易读的语法特性",
                "算法导航提供可视化的算法学习体验",
                "面试鸭帮助程序员准备技术面试",
                "Machine learning algorithms for natural language processing"
        );

        System.out.println("演示智能模型选择:");
        for (String text : testTexts) {
            String selectedModel = selectOptimalModel(text, null, null);
            System.out.println("文本: " + text.substring(0, Math.min(20, text.length())) + "...");
            System.out.println("选择模型: " + selectedModel);

            // 执行嵌入
            Embedding embedding = embedWithOptimization(text, null);
            System.out.println("向量维度: " + embedding.dimension());
            System.out.println();
        }

        // 运行基准测试
        System.out.println("运行基准测试...");
        Map<String, BenchmarkResult> benchmarkResults = runBenchmark(testTexts);

        System.out.println("\n=== 基准测试结果 ===");
        benchmarkResults.forEach((modelName, result) -> {
            System.out.printf("%s: 平均时间=%.2fms, 成功率=%.1f%%, 平均相似度=%.3f%n",
                    modelName, result.averageTime, result.successRate * 100, result.averageSimilarity);
        });

        // 质量评估演示
        System.out.println("\n质量评估演示:");
        double quality1 = evaluateEmbeddingQuality(
                "Java 编程语言", "Java 开发", "high", "all-minilm-l6-v2");
        double quality2 = evaluateEmbeddingQuality(
                "算法学习", "数据结构", "medium", "bge-small-zh");

        System.out.printf("相似文本质量分数: %.3f%n", quality1);
        System.out.printf("中等相关文本质量分数: %.3f%n", quality2);
    }

    // 内部类定义
    private static class ModelPerformance {
        private final String modelName;
        private long totalTime;
        private int totalCalls;
        private int successCalls;

        public ModelPerformance(String modelName) {
            this.modelName = modelName;
        }

        public synchronized void recordExecution(long duration, boolean success) {
            totalTime += duration;
            totalCalls++;
            if (success) successCalls++;
        }

        public double getAverageTime() {
            return totalCalls > 0 ? (double) totalTime / totalCalls : 0;
        }

        public double getSuccessRate() {
            return totalCalls > 0 ? (double) successCalls / totalCalls : 0;
        }

        public int getTotalCalls() {
            return totalCalls;
        }
    }

    private static class CachedEmbedding {
        final Embedding embedding;
        final long timestamp;
        final long TTL = 30 * 60 * 1000; // 30分钟过期

        public CachedEmbedding(Embedding embedding, long timestamp) {
            this.embedding = embedding;
            this.timestamp = timestamp;
        }

        public boolean isExpired() {
            return System.currentTimeMillis() - timestamp > TTL;
        }
    }

    public static class BenchmarkResult {
        final double averageTime;
        final int successCount;
        final int totalCount;
        final double averageSimilarity;
        final double successRate;

        public BenchmarkResult(double averageTime, int successCount, int totalCount, double averageSimilarity) {
            this.averageTime = averageTime;
            this.successCount = successCount;
            this.totalCount = totalCount;
            this.averageSimilarity = averageSimilarity;
            this.successRate = (double) successCount / totalCount;
        }
    }
}

这段程序输出结果:

plain
▼plain复制代码演示智能模型选择:
文本: Java 是一种强类型的面向对象编程...
选择模型: bge-small-zh
向量维度: 512

文本: Python 具有简洁易读的语法特...
选择模型: bge-small-zh  
向量维度: 512

文本: 算法导航提供可视化的算法学习...
选择模型: bge-small-zh
向量维度: 512

文本: 面试鸭帮助程序员准备技术面试...
选择模型: bge-small-zh
向量维度: 512

文本: Machine learning algori...
选择模型: all-minilm-l6-v2
向量维度: 384

运行基准测试...
测试模型: all-minilm-l6-v2
测试模型: bge-small-zh

=== 基准测试结果 ===
all-minilm-l6-v2: 平均时间=245.60ms, 成功率=100.0%, 平均相似度=0.456
bge-small-zh: 平均时间=312.40ms, 成功率=100.0%, 平均相似度=0.523

质量评估演示:
相似文本质量分数: 0.876
中等相关文本质量分数: 0.734

=== 嵌入模型性能报告 ===
all-minilm-l6-v2: 平均时间=245.60ms, 成功率=100.0%, 总调用=6
bge-small-zh: 平均时间=312.40ms, 成功率=100.0%, 总调用=9
缓存大小: 15

嵌入模型的优化是一个持续的过程,需要根据实际应用场景和用户反馈不断调整。通过性能监控、质量评估和智能选择,我们可以构建一个高效、准确的向量搜索系统,为用户提供优质的语义搜索体验。


练习题

练习题 1

实现一个向量相似度计算器,支持多种相似度计算方法(余弦相似度、欧几里得距离、曼哈顿距离),并比较它们在不同场景下的效果。

java
▼java复制代码package com.yupi.vectordb.util;

import java.util.Arrays;
import java.util.List;

/**
 * 编程导航向量相似度计算器
 * 支持多种相似度计算方法和性能对比
 */
public class VectorSimilarityCalculator {
    
    /**
     * 余弦相似度计算
     * 计算两个向量之间的夹角余弦值
     */
    public static double cosineSimilarity(float[] vectorA, float[] vectorB) {
        if (vectorA.length != vectorB.length) {
            throw new IllegalArgumentException("向量维度必须相同");
        }
        
        double dotProduct = 0.0;
        double normA = 0.0;
        double normB = 0.0;
        
        for (int i = 0; i < vectorA.length; i++) {
            dotProduct += vectorA[i] * vectorB[i];
            normA += vectorA[i] * vectorA[i];
            normB += vectorB[i] * vectorB[i];
        }
        
        if (normA == 0.0 || normB == 0.0) {
            return 0.0;
        }
        
        return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
    }
    
    /**
     * 欧几里得距离计算
     * 计算两个向量在空间中的直线距离
     */
    public static double euclideanDistance(float[] vectorA, float[] vectorB) {
        if (vectorA.length != vectorB.length) {
            throw new IllegalArgumentException("向量维度必须相同");
        }
        
        double sumSquaredDiff = 0.0;
        for (int i = 0; i < vectorA.length; i++) {
            double diff = vectorA[i] - vectorB[i];
            sumSquaredDiff += diff * diff;
        }
        
        return Math.sqrt(sumSquaredDiff);
    }
    
    /**
     * 曼哈顿距离计算
     * 计算两个向量各维度差值的绝对值之和
     */
    public static double manhattanDistance(float[] vectorA, float[] vectorB) {
        if (vectorA.length != vectorB.length) {
            throw new IllegalArgumentException("向量维度必须相同");
        }
        
        double sumAbsDiff = 0.0;
        for (int i = 0; i < vectorA.length; i++) {
            sumAbsDiff += Math.abs(vectorA[i] - vectorB[i]);
        }
        
        return sumAbsDiff;
    }
    
    /**
     * 点积计算
     */
    public static double dotProduct(float[] vectorA, float[] vectorB) {
        if (vectorA.length != vectorB.length) {
            throw new IllegalArgumentException("向量维度必须相同");
        }
        
        double product = 0.0;
        for (int i = 0; i < vectorA.length; i++) {
            product += vectorA[i] * vectorB[i];
        }
        
        return product;
    }
    
    /**
     * 综合相似度评估
     * 结合多种方法给出综合评分
     */
    public static SimilarityResult comprehensiveAnalysis(float[] vectorA, float[] vectorB) {
        double cosine = cosineSimilarity(vectorA, vectorB);
        double euclidean = euclideanDistance(vectorA, vectorB);
        double manhattan = manhattanDistance(vectorA, vectorB);
        double dot = dotProduct(vectorA, vectorB);
        
        // 归一化距离度量(转换为相似度)
        double maxEuclidean = Math.sqrt(vectorA.length * 2); // 假设向量值在[-1,1]范围
        double euclideanSimilarity = 1.0 - (euclidean / maxEuclidean);
        
        double maxManhattan = vectorA.length * 2; // 假设向量值在[-1,1]范围
        double manhattanSimilarity = 1.0 - (manhattan / maxManhattan);
        
        // 综合评分(可根据应用场景调整权重)
        double comprehensiveScore = 0.4 * cosine + 0.3 * euclideanSimilarity + 0.3 * manhattanSimilarity;
        
        return new SimilarityResult(cosine, euclidean, manhattan, dot, comprehensiveScore);
    }
    
    /**
     * 批量相似度计算
     * 找出与查询向量最相似的Top-K个向量
     */
    public static List<SimilarityMatch> findTopSimilar(float[] queryVector, 
                                                      List<float[]> candidateVectors, 
                                                      int topK, 
                                                      SimilarityMethod method) {
        return candidateVectors.stream()
                .map(candidate -> {
                    double similarity = calculateSimilarity(queryVector, candidate, method);
                    return new SimilarityMatch(candidate, similarity);
                })
                .sorted((a, b) -> Double.compare(b.similarity, a.similarity))
                .limit(topK)
                .toList();
    }
    
    /**
     * 根据指定方法计算相似度
     */
    private static double calculateSimilarity(float[] vectorA, float[] vectorB, SimilarityMethod method) {
        return switch (method) {
            case COSINE -> cosineSimilarity(vectorA, vectorB);
            case EUCLIDEAN -> 1.0 / (1.0 + euclideanDistance(vectorA, vectorB)); // 转换为相似度
            case MANHATTAN -> 1.0 / (1.0 + manhattanDistance(vectorA, vectorB)); // 转换为相似度
            case DOT_PRODUCT -> dotProduct(vectorA, vectorB);
        };
    }
    
    /**
     * 性能基准测试
     */
    public static void benchmarkMethods(List<float[]> testVectors) {
        if (testVectors.size() < 2) return;
        
        float[] baseVector = testVectors.get(0);
        int iterations = 1000;
        
        System.out.println("=== 相似度计算方法性能测试 ===");
        
        for (SimilarityMethod method : SimilarityMethod.values()) {
            long startTime = System.nanoTime();
            
            for (int i = 0; i < iterations; i++) {
                for (int j = 1; j < testVectors.size(); j++) {
                    calculateSimilarity(baseVector, testVectors.get(j), method);
                }
            }
            
            long endTime = System.nanoTime();
            double avgTime = (endTime - startTime) / (double)(iterations * (testVectors.size() - 1)) / 1_000_000;
            
            System.out.printf("%s: 平均耗时 %.4f ms%n", method, avgTime);
        }
    }
    
    /**
     * 演示不同相似度方法的效果
     */
    public static void demonstrateSimilarityMethods() {
        // 创建测试向量
        float[] vector1 = {1.0f, 0.0f, 0.0f, 1.0f};  // 基准向量
        float[] vector2 = {0.9f, 0.1f, 0.0f, 0.9f};  // 相似向量
        float[] vector3 = {0.0f, 1.0f, 1.0f, 0.0f};  // 不同向量
        float[] vector4 = {-1.0f, 0.0f, 0.0f, -1.0f}; // 相反向量
        
        System.out.println("=== 向量相似度计算演示 ===");
        System.out.println("基准向量: " + Arrays.toString(vector1));
        
        float[][] testVectors = {vector2, vector3, vector4};
        String[] descriptions = {"相似向量", "不同向量", "相反向量"};
        
        for (int i = 0; i < testVectors.length; i++) {
            System.out.println("\n" + descriptions[i] + ": " + Arrays.toString(testVectors[i]));
            
            SimilarityResult result = comprehensiveAnalysis(vector1, testVectors[i]);
            System.out.printf("余弦相似度: %.4f%n", result.cosineSimilarity);
            System.out.printf("欧几里得距离: %.4f%n", result.euclideanDistance);
            System.out.printf("曼哈顿距离: %.4f%n", result.manhattanDistance);
            System.out.printf("点积: %.4f%n", result.dotProduct);
            System.out.printf("综合评分: %.4f%n", result.comprehensiveScore);
        }
        
        // 批量相似度测试
        List<float[]> candidates = Arrays.asList(vector2, vector3, vector4);
        
        System.out.println("\n=== Top-2 最相似向量 (余弦相似度) ===");
        List<SimilarityMatch> topSimilar = findTopSimilar(vector1, candidates, 2, SimilarityMethod.COSINE);
        for (int i = 0; i < topSimilar.size(); i++) {
            SimilarityMatch match = topSimilar.get(i);
            System.out.printf("%d. %s, 相似度: %.4f%n", 
                i + 1, Arrays.toString(match.vector), match.similarity);
        }
        
        // 性能测试
        List<float[]> perfTestVectors = Arrays.asList(vector1, vector2, vector3, vector4);
        benchmarkMethods(perfTestVectors);
    }
    
    // 枚举和数据类
    public enum SimilarityMethod {
        COSINE, EUCLIDEAN, MANHATTAN, DOT_PRODUCT
    }
    
    public static class SimilarityResult {
        public final double cosineSimilarity;
        public final double euclideanDistance;
        public final double manhattanDistance;
        public final double dotProduct;
        public final double comprehensiveScore;
        
        public SimilarityResult(double cosine, double euclidean, double manhattan, 
                              double dot, double comprehensive) {
            this.cosineSimilarity = cosine;
            this.euclideanDistance = euclidean;
            this.manhattanDistance = manhattan;
            this.dotProduct = dot;
            this.comprehensiveScore = comprehensive;
        }
    }
    
    public static class SimilarityMatch {
        public final float[] vector;
        public final double similarity;
        
        public SimilarityMatch(float[] vector, double similarity) {
            this.vector = vector;
            this.similarity = similarity;
        }
    }
}

练习题 2

设计一个向量数据库性能监控系统,能够实时监控查询响应时间、缓存命中率、向量存储大小等关键指标,并提供告警功能。

java
▼java复制代码package com.yupi.vectordb.monitor;

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;

/**
 * 面试鸭向量数据库性能监控系统
 * 提供实时性能监控、指标统计和告警功能
 */
public class VectorDBPerformanceMonitor {
    
    private final ScheduledExecutorService scheduler;
    private final Map<String, AtomicLong> counters;
    private final List<QueryRecord> queryHistory;
    private final Map<String, AlertRule> alertRules;
    private final Queue<AlertEvent> activeAlerts;
    
    // 性能指标
    private volatile long totalQueries = 0;
    private volatile long cacheHits = 0;
    private volatile long cacheMisses = 0;
    private volatile long totalResponseTime = 0;
    private volatile long vectorStoreSize = 0;
    private volatile double memoryUsage = 0.0;
    
    // 监控配置
    private final int maxHistorySize = 10000;
    private final int maxActiveAlerts = 100;
    
    public VectorDBPerformanceMonitor() {
        this.scheduler = Executors.newScheduledThreadPool(2);
        this.counters = new ConcurrentHashMap<>();
        this.queryHistory = Collections.synchronizedList(new ArrayList<>());
        this.alertRules = new ConcurrentHashMap<>();
        this.activeAlerts = new LinkedList<>();
        
        initializeDefaultMetrics();
        startMonitoring();
    }
    
    /**
     * 初始化默认监控指标
     */
    private void initializeDefaultMetrics() {
        counters.put("query_count", new AtomicLong(0));
        counters.put("error_count", new AtomicLong(0));
        counters.put("cache_hits", new AtomicLong(0));
        counters.put("cache_misses", new AtomicLong(0));
        counters.put("embeddings_generated", new AtomicLong(0));
        
        // 设置默认告警规则
        addAlertRule("high_response_time", new AlertRule(
            "平均响应时间过高", 1000.0, AlertRule.Operator.GREATER_THAN, 5));
        addAlertRule("low_cache_hit_rate", new AlertRule(
            "缓存命中率过低", 0.5, AlertRule.Operator.LESS_THAN, 3));
        addAlertRule("high_error_rate", new AlertRule(
            "错误率过高", 0.05, AlertRule.Operator.GREATER_THAN, 2));
    }
    
    /**
     * 记录查询性能
     */
    public void recordQuery(String operation, long responseTimeMs, boolean success, boolean cacheHit) {
        QueryRecord record = new QueryRecord(
            System.currentTimeMillis(), operation, responseTimeMs, success, cacheHit);
        
        // 更新统计数据
        counters.get("query_count").incrementAndGet();
        totalQueries++;
        totalResponseTime += responseTimeMs;
        
        if (cacheHit) {
            counters.get("cache_hits").incrementAndGet();
            cacheHits++;
        } else {
            counters.get("cache_misses").incrementAndGet();
            cacheMisses++;
        }
        
        if (!success) {
            counters.get("error_count").incrementAndGet();
        }
        
        // 添加到历史记录
        synchronized (queryHistory) {
            queryHistory.add(record);
            if (queryHistory.size() > maxHistorySize) {
                queryHistory.remove(0);
            }
        }
        
        // 检查告警条件
        checkAlertConditions();
    }
    
    /**
     * 记录向量生成
     */
    public void recordEmbeddingGeneration(int vectorCount, long processingTimeMs) {
        counters.get("embeddings_generated").addAndGet(vectorCount);
        
        QueryRecord record = new QueryRecord(
            System.currentTimeMillis(), "embedding_generation", processingTimeMs, true, false);
        
        synchronized (queryHistory) {
            queryHistory.add(record);
        }
    }
    
    /**
     * 更新向量存储大小
     */
    public void updateVectorStoreSize(long newSize) {
        this.vectorStoreSize = newSize;
    }
    
    /**
     * 更新内存使用情况
     */
    public void updateMemoryUsage(double memoryUsagePercent) {
        this.memoryUsage = memoryUsagePercent;
    }
    
    /**
     * 获取实时性能指标
     */
    public PerformanceMetrics getCurrentMetrics() {
        long totalQueries = this.totalQueries;
        double avgResponseTime = totalQueries > 0 ? (double) totalResponseTime / totalQueries : 0;
        double cacheHitRate = (cacheHits + cacheMisses) > 0 ? 
                             (double) cacheHits / (cacheHits + cacheMisses) : 0;
        double errorRate = totalQueries > 0 ? 
                          (double) counters.get("error_count").get() / totalQueries : 0;
        
        return new PerformanceMetrics(
            totalQueries, avgResponseTime, cacheHitRate, errorRate,
            vectorStoreSize, memoryUsage, counters.get("embeddings_generated").get()
        );
    }
    
    /**
     * 获取指定时间窗口的性能统计
     */
    public PerformanceMetrics getMetricsInTimeWindow(long windowMinutes) {
        long cutoffTime = System.currentTimeMillis() - (windowMinutes * 60 * 1000);
        
        List<QueryRecord> windowRecords;
        synchronized (queryHistory) {
            windowRecords = queryHistory.stream()
                    .filter(record -> record.timestamp >= cutoffTime)
                    .toList();
        }
        
        if (windowRecords.isEmpty()) {
            return new PerformanceMetrics(0, 0, 0, 0, vectorStoreSize, memoryUsage, 0);
        }
        
        long totalQueries = windowRecords.size();
        double avgResponseTime = windowRecords.stream()
                .mapToLong(r -> r.responseTimeMs)
                .average().orElse(0);
        
        long cacheHits = windowRecords.stream()
                .mapToLong(r -> r.cacheHit ? 1 : 0).sum();
        double cacheHitRate = (double) cacheHits / totalQueries;
        
        long errors = windowRecords.stream()
                .mapToLong(r -> r.success ? 0 : 1).sum();
        double errorRate = (double) errors / totalQueries;
        
        return new PerformanceMetrics(
            totalQueries, avgResponseTime, cacheHitRate, errorRate,
            vectorStoreSize, memoryUsage, 0
        );
    }
    
    /**
     * 添加告警规则
     */
    public void addAlertRule(String ruleName, AlertRule rule) {
        alertRules.put(ruleName, rule);
    }
    
    /**
     * 检查告警条件
     */
    private void checkAlertConditions() {
        PerformanceMetrics current = getCurrentMetrics();
        
        alertRules.forEach((name, rule) -> {
            double value = getMetricValue(current, rule.metricName);
            boolean triggered = rule.evaluate(value);
            
            if (triggered) {
                AlertEvent alert = new AlertEvent(
                    name, rule.description, value, System.currentTimeMillis());
                
                synchronized (activeAlerts) {
                    activeAlerts.offer(alert);
                    if (activeAlerts.size() > maxActiveAlerts) {
                        activeAlerts.poll();
                    }
                }
                
                System.err.printf("[ALERT] %s: %s (值: %.4f)%n", 
                    name, rule.description, value);
            }
        });
    }
    
    /**
     * 根据指标名称获取指标值
     */
    private double getMetricValue(PerformanceMetrics metrics, String metricName) {
        return switch (metricName) {
            case "平均响应时间过高" -> metrics.averageResponseTime;
            case "缓存命中率过低" -> metrics.cacheHitRate;
            case "错误率过高" -> metrics.errorRate;
            case "内存使用率" -> metrics.memoryUsage;
            default -> 0.0;
        };
    }
    
    /**
     * 启动监控任务
     */
    private void startMonitoring() {
        // 定期输出性能报告
        scheduler.scheduleAtFixedRate(this::printPerformanceReport, 1, 5, TimeUnit.MINUTES);
        
        // 定期清理过期告警
        scheduler.scheduleAtFixedRate(this::cleanupExpiredAlerts, 1, 1, TimeUnit.HOURS);
    }
    
    /**
     * 打印性能报告
     */
    public void printPerformanceReport() {
        PerformanceMetrics current = getCurrentMetrics();
        PerformanceMetrics last5Min = getMetricsInTimeWindow(5);
        
        String timestamp = LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
        
        System.out.println("\n" + "=".repeat(50));
        System.out.println("代码小抄向量数据库性能报告 - " + timestamp);
        System.out.println("=".repeat(50));
        
        System.out.println("【总体统计】");
        System.out.printf("  总查询数: %d%n", current.totalQueries);
        System.out.printf("  平均响应时间: %.2f ms%n", current.averageResponseTime);
        System.out.printf("  缓存命中率: %.2f%%%n", current.cacheHitRate * 100);
        System.out.printf("  错误率: %.2f%%%n", current.errorRate * 100);
        System.out.printf("  向量存储大小: %d%n", current.vectorStoreSize);
        System.out.printf("  内存使用率: %.2f%%%n", current.memoryUsage);
        System.out.printf("  生成向量数: %d%n", current.embeddingsGenerated);
        
        System.out.println("\n【最近5分钟】");
        System.out.printf("  查询数: %d%n", last5Min.totalQueries);
        System.out.printf("  平均响应时间: %.2f ms%n", last5Min.averageResponseTime);
        System.out.printf("  缓存命中率: %.2f%%%n", last5Min.cacheHitRate * 100);
        System.out.printf("  错误率: %.2f%%%n", last5Min.errorRate * 100);
        
        // 显示活跃告警
        synchronized (activeAlerts) {
            if (!activeAlerts.isEmpty()) {
                System.out.println("\n【活跃告警】");
                activeAlerts.forEach(alert -> 
                    System.out.printf("  - %s: %s (时间: %s)%n", 
                        alert.ruleName, alert.description,
                        LocalDateTime.ofInstant(
                            java.time.Instant.ofEpochMilli(alert.timestamp),
                            java.time.ZoneId.systemDefault()
                        ).format(DateTimeFormatter.ofPattern("HH:mm:ss"))
                    )
                );
            }
        }
        
        System.out.println("=".repeat(50));
    }
    
    /**
     * 清理过期告警
     */
    private void cleanupExpiredAlerts() {
        long expireTime = System.currentTimeMillis() - (24 * 60 * 60 * 1000); // 24小时过期
        
        synchronized (activeAlerts) {
            activeAlerts.removeIf(alert -> alert.timestamp < expireTime);
        }
    }
    
    /**
     * 获取系统健康状态
     */
    public SystemHealth getSystemHealth() {
        PerformanceMetrics metrics = getCurrentMetrics();
        
        // 根据关键指标评估系统健康状态
        boolean healthy = true;
        List<String> issues = new ArrayList<>();
        
        if (metrics.averageResponseTime > 2000) {
            healthy = false;
            issues.add("响应时间过长");
        }
        
        if (metrics.cacheHitRate < 0.3) {
            healthy = false;
            issues.add("缓存命中率过低");
        }
        
        if (metrics.errorRate > 0.1) {
            healthy = false;
            issues.add("错误率过高");
        }
        
        if (metrics.memoryUsage > 90) {
            healthy = false;
            issues.add("内存使用率过高");
        }
        
        return new SystemHealth(healthy, issues, metrics);
    }
    
    /**
     * 演示监控系统功能
     */
    public void demonstrateMonitoring() {
        System.out.println("=== 剪切助手向量数据库监控演示 ===");
        
        // 模拟查询操作
        System.out.println("模拟查询操作...");
        for (int i = 0; i < 50; i++) {
            long responseTime = 100 + (long)(Math.random() * 500); // 100-600ms
            boolean success = Math.random() > 0.05; // 95%成功率
            boolean cacheHit = Math.random() > 0.4; // 60%缓存命中率
            
            recordQuery("semantic_search", responseTime, success, cacheHit);
            
            if (i % 10 == 0) {
                recordEmbeddingGeneration(5, 200 + (long)(Math.random() * 300));
            }
        }
        
        // 更新存储信息
        updateVectorStoreSize(1000000);
        updateMemoryUsage(65.5);
        
        // 输出当前指标
        PerformanceMetrics current = getCurrentMetrics();
        System.out.println("\n当前性能指标:");
        System.out.printf("总查询数: %d%n", current.totalQueries);
        System.out.printf("平均响应时间: %.2f ms%n", current.averageResponseTime);
        System.out.printf("缓存命中率: %.2f%%%n", current.cacheHitRate * 100);
        System.out.printf("错误率: %.2f%%%n", current.errorRate * 100);
        
        // 检查系统健康状态
        SystemHealth health = getSystemHealth();
        System.out.println("\n系统健康状态: " + (health.healthy ? "健康" : "异常"));
        if (!health.issues.isEmpty()) {
            System.out.println("发现问题: " + String.join(", ", health.issues));
        }
        
        // 模拟高响应时间触发告警
        System.out.println("\n模拟高响应时间场景...");
        for (int i = 0; i < 5; i++) {
            recordQuery("slow_query", 1500 + (long)(Math.random() * 1000), true, false);
        }
    }
    
    // 内部类定义
    public static class QueryRecord {
        public final long timestamp;
        public final String operation;
        public final long responseTimeMs;
        public final boolean success;
        public final boolean cacheHit;
        
        public QueryRecord(long timestamp, String operation, long responseTimeMs, 
                          boolean success, boolean cacheHit) {
            this.timestamp = timestamp;
            this.operation = operation;
            this.responseTimeMs = responseTimeMs;
            this.success = success;
            this.cacheHit = cacheHit;
        }
    }
    
    public static class PerformanceMetrics {
        public final long totalQueries;
        public final double averageResponseTime;
        public final double cacheHitRate;
        public final double errorRate;
        public final long vectorStoreSize;
        public final double memoryUsage;
        public final long embeddingsGenerated;
        
        public PerformanceMetrics(long totalQueries, double averageResponseTime, 
                                double cacheHitRate, double errorRate, 
                                long vectorStoreSize, double memoryUsage, 
                                long embeddingsGenerated) {
            this.totalQueries = totalQueries;
            this.averageResponseTime = averageResponseTime;
            this.cacheHitRate = cacheHitRate;
            this.errorRate = errorRate;
            this.vectorStoreSize = vectorStoreSize;
            this.memoryUsage = memoryUsage;
            this.embeddingsGenerated = embeddingsGenerated;
        }
    }
    
    public static class AlertRule {
        public final String description;
        public final double threshold;
        public final Operator operator;
        public final int priority;
        public final String metricName;
        
        public AlertRule(String description, double threshold, Operator operator, int priority) {
            this.description = description;
            this.threshold = threshold;
            this.operator = operator;
            this.priority = priority;
            this.metricName = description; // 简化实现
        }
        
        public boolean evaluate(double value) {
            return switch (operator) {
                case GREATER_THAN -> value > threshold;
                case LESS_THAN -> value < threshold;
                case EQUALS -> Math.abs(value - threshold) < 0.001;
            };
        }
        
        public enum Operator {
            GREATER_THAN, LESS_THAN, EQUALS
        }
    }
    
    public static class AlertEvent {
        public final String ruleName;
        public final String description;
        public final double value;
        public final long timestamp;
        
        public AlertEvent(String ruleName, String description, double value, long timestamp) {
            this.ruleName = ruleName;
            this.description = description;
            this.value = value;
            this.timestamp = timestamp;
        }
    }
    
    public static class SystemHealth {
        public final boolean healthy;
        public final List<String> issues;
        public final PerformanceMetrics metrics;
        
        public SystemHealth(boolean healthy, List<String> issues, PerformanceMetrics metrics) {
            this.healthy = healthy;
            this.issues = issues;
            this.metrics = metrics;
        }
    }
}

练习题 3

构建一个智能文档分块系统,能够根据文档类型、内容特征和目标用途自动选择最优的分块策略,并支持重叠分块、语义分块等高级功能。

java
▼java复制代码package com.yupi.vectordb.chunking;


import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;

import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;

/**
 * 算法导航智能文档分块系统
 * 提供多种分块策略和自动优化功能
 */
public class IntelligentDocumentChunker {

    private final EmbeddingModel embeddingModel;
    private final Map<DocumentType, ChunkingStrategy> defaultStrategies;
    private final List<ChunkingRule> customRules;

    // 分块配置
    private static final int DEFAULT_CHUNK_SIZE = 512;
    private static final int DEFAULT_OVERLAP = 50;
    private static final double SEMANTIC_SIMILARITY_THRESHOLD = 0.8;

    public IntelligentDocumentChunker() {
        this.embeddingModel = new BgeSmallZhEmbeddingModel();
        this.defaultStrategies = initializeDefaultStrategies();
        this.customRules = new ArrayList<>();
    }

    /**
     * 初始化默认分块策略
     */
    private Map<DocumentType, ChunkingStrategy> initializeDefaultStrategies() {
        Map<DocumentType, ChunkingStrategy> strategies = new HashMap<>();

        strategies.put(DocumentType.CODE, new ChunkingStrategy(
                ChunkingMethod.FUNCTION_BASED, 200, 20, "按函数和类分块"));
        strategies.put(DocumentType.TECHNICAL_DOC, new ChunkingStrategy(
                ChunkingMethod.SEMANTIC, 512, 50, "基于语义相关性分块"));
        strategies.put(DocumentType.TUTORIAL, new ChunkingStrategy(
                ChunkingMethod.HIERARCHICAL, 400, 40, "按章节层次分块"));
        strategies.put(DocumentType.FAQ, new ChunkingStrategy(
                ChunkingMethod.QA_PAIR, 300, 0, "按问答对分块"));
        strategies.put(DocumentType.GENERAL, new ChunkingStrategy(
                ChunkingMethod.SLIDING_WINDOW, 512, 50, "滑动窗口分块"));

        return strategies;
    }

    /**
     * 智能文档分块主方法
     */
    public List<TextSegment> chunkDocument(Document document, ChunkingOptions options) {
        // 分析文档特征
        DocumentAnalysis analysis = analyzeDocument(document);

        // 选择最优分块策略
        ChunkingStrategy strategy = selectOptimalStrategy(analysis, options);

        // 执行分块
        List<TextSegment> chunks = executeChunking(document, strategy, analysis);

        // 后处理优化
        chunks = postProcessChunks(chunks, options);

        return chunks;
    }

    /**
     * 分析文档特征
     */
    private DocumentAnalysis analyzeDocument(Document document) {
        String text = document.text();

        // 基本统计信息
        int totalLength = text.length();
        int lineCount = text.split("\n").length;
        int paragraphCount = text.split("\n\\s*\n").length;

        // 检测文档类型
        DocumentType type = detectDocumentType(text);

        // 检测语言
        String language = detectLanguage(text);

        // 分析结构特征
        StructureFeatures structure = analyzeStructure(text);

        // 计算内容复杂度
        double complexity = calculateComplexity(text);

        return new DocumentAnalysis(type, language, totalLength, lineCount,
                paragraphCount, structure, complexity);
    }

    /**
     * 检测文档类型
     */
    private DocumentType detectDocumentType(String text) {
        // 代码文档检测
        if (containsCodePatterns(text)) {
            return DocumentType.CODE;
        }

        // FAQ检测
        if (containsQAPatterns(text)) {
            return DocumentType.FAQ;
        }

        // 教程检测
        if (containsTutorialPatterns(text)) {
            return DocumentType.TUTORIAL;
        }

        // 技术文档检测
        if (containsTechnicalPatterns(text)) {
            return DocumentType.TECHNICAL_DOC;
        }

        return DocumentType.GENERAL;
    }

    /**
     * 选择最优分块策略
     */
    private ChunkingStrategy selectOptimalStrategy(DocumentAnalysis analysis, ChunkingOptions options) {
        // 首先检查用户自定义策略
        if (options.preferredStrategy != null) {
            return options.preferredStrategy;
        }

        // 根据文档类型选择默认策略
        ChunkingStrategy baseStrategy = defaultStrategies.get(analysis.documentType);

        // 根据文档特征调整策略
        return adjustStrategyByAnalysis(baseStrategy, analysis, options);
    }

    /**
     * 根据分析结果调整策略
     */
    private ChunkingStrategy adjustStrategyByAnalysis(ChunkingStrategy baseStrategy,
                                                      DocumentAnalysis analysis,
                                                      ChunkingOptions options) {
        int adjustedChunkSize = baseStrategy.chunkSize;
        int adjustedOverlap = baseStrategy.overlap;

        // 根据文档长度调整
        if (analysis.totalLength < 1000) {
            adjustedChunkSize = Math.min(adjustedChunkSize, 200);
        } else if (analysis.totalLength > 10000) {
            adjustedChunkSize = Math.max(adjustedChunkSize, 800);
        }

        // 根据复杂度调整重叠
        if (analysis.complexity > 0.7) {
            adjustedOverlap = (int)(adjustedOverlap * 1.5);
        }

        // 根据目标用途调整
        if (options.targetUse == TargetUse.QA_SYSTEM) {
            adjustedChunkSize = Math.min(adjustedChunkSize, 400);
            adjustedOverlap = Math.max(adjustedOverlap, 40);
        } else if (options.targetUse == TargetUse.SUMMARIZATION) {
            adjustedChunkSize = Math.max(adjustedChunkSize, 600);
        }

        return new ChunkingStrategy(baseStrategy.method, adjustedChunkSize,
                adjustedOverlap, baseStrategy.description);
    }

    /**
     * 执行分块操作
     */
    private List<TextSegment> executeChunking(Document document, ChunkingStrategy strategy,
                                              DocumentAnalysis analysis) {
        return switch (strategy.method) {
            case SLIDING_WINDOW -> slidingWindowChunking(document, strategy);
            case SEMANTIC -> semanticChunking(document, strategy);
            case HIERARCHICAL -> hierarchicalChunking(document, strategy, analysis);
            case FUNCTION_BASED -> functionBasedChunking(document, strategy);
            case QA_PAIR -> qaPairChunking(document, strategy);
            case SENTENCE_BOUNDARY -> sentenceBoundaryChunking(document, strategy);
        };
    }

    /**
     * 滑动窗口分块
     */
    private List<TextSegment> slidingWindowChunking(Document document, ChunkingStrategy strategy) {
        String text = document.text();
        List<TextSegment> chunks = new ArrayList<>();

        int chunkSize = strategy.chunkSize;
        int overlap = strategy.overlap;
        int step = chunkSize - overlap;

        for (int i = 0; i < text.length(); i += step) {
            int end = Math.min(i + chunkSize, text.length());
            String chunkText = text.substring(i, end);

            // 在词边界处切分
            if (end < text.length()) {
                int lastSpace = chunkText.lastIndexOf(' ');
                if (lastSpace > chunkSize / 2) {
                    chunkText = chunkText.substring(0, lastSpace);
                    end = i + lastSpace;
                }
            }

            Map<String, String> metadataMap = new HashMap<>();
            metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
            metadataMap.put("startPos", String.valueOf(i));
            metadataMap.put("endPos", String.valueOf(end));
            metadataMap.put("method", "sliding_window");
            Metadata metadata = new Metadata(metadataMap);

            chunks.add(TextSegment.from(chunkText.trim(), metadata));

            if (end >= text.length()) break;
        }

        return chunks;
    }

    /**
     * 语义分块
     */
    private List<TextSegment> semanticChunking(Document document, ChunkingStrategy strategy) {
        String text = document.text();

        // 首先按句子分割
        List<String> sentences = splitIntoSentences(text);
        List<TextSegment> chunks = new ArrayList<>();

        StringBuilder currentChunk = new StringBuilder();
        List<String> currentSentences = new ArrayList<>();

        for (String sentence : sentences) {
            // 检查是否应该开始新的分块
            if (shouldStartNewSemanticChunk(currentSentences, sentence, strategy)) {
                if (currentChunk.length() > 0) {
                    Map<String, String> metadataMap = new HashMap<>();
                    metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
                    metadataMap.put("sentenceCount", String.valueOf(currentSentences.size()));
                    metadataMap.put("method", "semantic");

                    Metadata metadata = new Metadata(metadataMap);
                    chunks.add(TextSegment.from(currentChunk.toString().trim(), metadata));

                    // 处理重叠
                    if (strategy.overlap > 0 && !currentSentences.isEmpty()) {
                        int overlapSentences = Math.min(2, currentSentences.size());
                        currentChunk = new StringBuilder();
                        currentSentences = new ArrayList<>();
                        for (int i = currentSentences.size() - overlapSentences; i < currentSentences.size(); i++) {
                            currentChunk.append(currentSentences.get(i)).append(" ");
                            currentSentences.add(currentSentences.get(i));
                        }
                    } else {
                        currentChunk = new StringBuilder();
                        currentSentences = new ArrayList<>();
                    }
                }
            }

            currentChunk.append(sentence).append(" ");
            currentSentences.add(sentence);
        }

        // 添加最后一个分块
        if (currentChunk.length() > 0) {
            Map<String, String> metadataMap = new HashMap<>();
            metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
            metadataMap.put("sentenceCount", String.valueOf(currentSentences.size()));
            metadataMap.put("method", "semantic");
            Metadata metadata = new Metadata(metadataMap);

            chunks.add(TextSegment.from(currentChunk.toString().trim(), metadata));
        }

        return chunks;
    }

    /**
     * 层次结构分块
     */
    private List<TextSegment> hierarchicalChunking(Document document, ChunkingStrategy strategy,
                                                   DocumentAnalysis analysis) {
        String text = document.text();
        List<TextSegment> chunks = new ArrayList<>();

        // 检测标题和段落结构
        List<Section> sections = detectSections(text);

        for (Section section : sections) {
            if (section.content.length() <= strategy.chunkSize) {
                // 整个章节作为一个分块
                Map<String, String> metadataMap = new HashMap<>();
                metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
                metadataMap.put("sectionTitle", section.title);
                metadataMap.put("sectionLevel", String.valueOf(section.level));
                metadataMap.put("method", "hierarchical");
                Metadata metadata = new Metadata(metadataMap);

                chunks.add(TextSegment.from(section.content.trim(), metadata));
            } else {
                // 章节太长,需要进一步分块
                List<TextSegment> sectionChunks = slidingWindowChunking(
                        Document.from(section.content), strategy);

                for (int i = 0; i < sectionChunks.size(); i++) {
                    TextSegment chunk = sectionChunks.get(i);
                    chunk.metadata().put("sectionTitle", section.title);
                    chunk.metadata().put("sectionLevel", String.valueOf(section.level));
                    chunk.metadata().put("subChunkIndex", String.valueOf(i));
                    chunk.metadata().put("method", "hierarchical");
                    chunks.add(chunk);
                }
            }
        }

        return chunks;
    }

    /**
     * 基于函数的分块(用于代码文档)
     */
    private List<TextSegment> functionBasedChunking(Document document, ChunkingStrategy strategy) {
        String text = document.text();
        List<TextSegment> chunks = new ArrayList<>();

        // 使用正则表达式检测函数、类等代码结构
        List<CodeBlock> codeBlocks = extractCodeBlocks(text);

        for (CodeBlock block : codeBlocks) {
            Map<String, String> metadataMap = new HashMap<>();
            metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
            metadataMap.put("blockType", block.type);
            metadataMap.put("blockName", block.name);
            metadataMap.put("method", "function_based");
            Metadata metadata = new Metadata(metadataMap);

            chunks.add(TextSegment.from(block.content.trim(), metadata));
        }

        return chunks;
    }

    /**
     * 问答对分块
     */
    private List<TextSegment> qaPairChunking(Document document, ChunkingStrategy strategy) {
        String text = document.text();
        List<TextSegment> chunks = new ArrayList<>();

        // 检测问答对模式
        List<QAPair> qaPairs = extractQAPairs(text);

        for (QAPair pair : qaPairs) {
            Map<String, String> metadataMap = new HashMap<>();
            metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
            metadataMap.put("question", pair.question);
            metadataMap.put("method", "qa_pair");
            Metadata metadata = new Metadata(metadataMap);

            String chunkContent = pair.question + "\n" + pair.answer;
            chunks.add(TextSegment.from(chunkContent.trim(), metadata));
        }

        return chunks;
    }

    /**
     * 句子边界分块
     */
    private List<TextSegment> sentenceBoundaryChunking(Document document, ChunkingStrategy strategy) {
        String text = document.text();
        List<String> sentences = splitIntoSentences(text);
        List<TextSegment> chunks = new ArrayList<>();

        StringBuilder currentChunk = new StringBuilder();
        int sentenceCount = 0;

        for (String sentence : sentences) {
            if (currentChunk.length() + sentence.length() > strategy.chunkSize &&
                    currentChunk.length() > 0) {

                Map<String, String> metadataMap = new HashMap<>();
                metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
                metadataMap.put("sentenceCount", String.valueOf(sentenceCount));
                metadataMap.put("method", "sentence_boundary");
                Metadata metadata = new Metadata(metadataMap);

                chunks.add(TextSegment.from(currentChunk.toString().trim(), metadata));

                currentChunk = new StringBuilder();
                sentenceCount = 0;
            }

            currentChunk.append(sentence).append(" ");
            sentenceCount++;
        }

        if (currentChunk.length() > 0) {
            Map<String, String> metadataMap = new HashMap<>();
            metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
            metadataMap.put("sentenceCount", String.valueOf(sentenceCount));
            metadataMap.put("method", "sentence_boundary");
            Metadata metadata = new Metadata(metadataMap);

            chunks.add(TextSegment.from(currentChunk.toString().trim(), metadata));
        }

        return chunks;
    }

    /**
     * 分块后处理
     */
    private List<TextSegment> postProcessChunks(List<TextSegment> chunks, ChunkingOptions options) {
        List<TextSegment> processedChunks = new ArrayList<>(chunks);

        // 过滤过短或过长的分块
        processedChunks = processedChunks.stream()
                .filter(chunk -> chunk.text().length() >= options.minChunkSize)
                .filter(chunk -> chunk.text().length() <= options.maxChunkSize)
                .collect(Collectors.toList());

        // 合并过小的相邻分块
        if (options.mergeTinyChunks) {
            processedChunks = mergeTinyChunks(processedChunks, options.minChunkSize);
        }

        // 优化分块边界
        if (options.optimizeBoundaries) {
            processedChunks = optimizeChunkBoundaries(processedChunks);
        }

        return processedChunks;
    }

    // 辅助方法实现
    private boolean containsCodePatterns(String text) {
        return text.contains("public class") || text.contains("def ") ||
                text.contains("function") || text.contains("import ");
    }

    private boolean containsQAPatterns(String text) {
        return text.contains("问:") || text.contains("答:") ||
                text.contains("Q:") || text.contains("A:");
    }

    private boolean containsTutorialPatterns(String text) {
        return text.contains("第一章") || text.contains("步骤") ||
                text.contains("教程") || text.contains("学习");
    }

    private boolean containsTechnicalPatterns(String text) {
        return text.contains("API") || text.contains("配置") ||
                text.contains("参数") || text.contains("接口");
    }

    private String detectLanguage(String text) {
        int chineseCount = 0;
        int totalChars = 0;

        for (char c : text.toCharArray()) {
            if (Character.isLetter(c)) {
                totalChars++;
                if (c >= 0x4e00 && c <= 0x9fff) {
                    chineseCount++;
                }
            }
        }

        return totalChars > 0 && (double) chineseCount / totalChars > 0.3 ? "zh" : "en";
    }

    private StructureFeatures analyzeStructure(String text) {
        int headingCount = text.split("(?m)^#{1,6}\\s").length - 1;
        int listCount = text.split("(?m)^[\\*\\-\\+]\\s").length - 1;
        int codeBlockCount = text.split("```").length / 2;

        return new StructureFeatures(headingCount, listCount, codeBlockCount);
    }

    private double calculateComplexity(String text) {
        // 简单的复杂度计算:基于句子长度、特殊字符等
        String[] sentences = text.split("[.!?]+");
        double avgSentenceLength = sentences.length > 0 ?
                (double) text.length() / sentences.length : 0;

        int specialChars = text.replaceAll("[a-zA-Z0-9\\s\\u4e00-\\u9fff]", "").length();
        double specialCharRatio = (double) specialChars / text.length();

        return Math.min(1.0, (avgSentenceLength / 100.0) * 0.7 + specialCharRatio * 0.3);
    }

    private List<String> splitIntoSentences(String text) {
        // 简单的句子分割,可以使用更高级的NLP库
        return Arrays.stream(text.split("[.!?]+"))
                .map(String::trim)
                .filter(s -> !s.isEmpty())
                .collect(Collectors.toList());
    }

    private boolean shouldStartNewSemanticChunk(List<String> currentSentences,
                                                String newSentence,
                                                ChunkingStrategy strategy) {
        if (currentSentences.isEmpty()) return false;

        String currentText = String.join(" ", currentSentences);
        if (currentText.length() + newSentence.length() > strategy.chunkSize) {
            return true;
        }

        // 语义相似度检查(简化实现)
        if (currentSentences.size() >= 3) {
            double similarity = calculateSemanticSimilarity(currentText, newSentence);
            return similarity < SEMANTIC_SIMILARITY_THRESHOLD;
        }

        return false;
    }

    private double calculateSemanticSimilarity(String text1, String text2) {
        // 简化的语义相似度计算,实际应该使用嵌入向量
        Set<String> words1 = new HashSet<>(Arrays.asList(text1.toLowerCase().split("\\s+")));
        Set<String> words2 = new HashSet<>(Arrays.asList(text2.toLowerCase().split("\\s+")));

        Set<String> intersection = new HashSet<>(words1);
        intersection.retainAll(words2);

        Set<String> union = new HashSet<>(words1);
        union.addAll(words2);

        return union.isEmpty() ? 0.0 : (double) intersection.size() / union.size();
    }

    private List<Section> detectSections(String text) {
        List<Section> sections = new ArrayList<>();
        String[] lines = text.split("\n");

        StringBuilder currentContent = new StringBuilder();
        String currentTitle = "Introduction";
        int currentLevel = 1;

        for (String line : lines) {
            if (line.matches("^#{1,6}\\s+.*")) {
                // 保存上一个章节
                if (currentContent.length() > 0) {
                    sections.add(new Section(currentTitle, currentLevel, currentContent.toString()));
                }

                // 开始新章节
                currentLevel = line.indexOf(' ');
                currentTitle = line.substring(currentLevel + 1);
                currentContent = new StringBuilder();
            } else {
                currentContent.append(line).append("\n");
            }
        }

        // 添加最后一个章节
        if (currentContent.length() > 0) {
            sections.add(new Section(currentTitle, currentLevel, currentContent.toString()));
        }

        return sections;
    }

    private List<CodeBlock> extractCodeBlocks(String text) {
        List<CodeBlock> blocks = new ArrayList<>();

        // Java 类检测
        Pattern classPattern = Pattern.compile("public\\s+class\\s+(\\w+)\\s*\\{([^}]*)\\}", Pattern.DOTALL);
        Matcher classMatcher = classPattern.matcher(text);
        while (classMatcher.find()) {
            blocks.add(new CodeBlock("class", classMatcher.group(1), classMatcher.group(0)));
        }

        // Java 方法检测
        Pattern methodPattern = Pattern.compile("public\\s+\\w+\\s+(\\w+)\\s*\\([^)]*\\)\\s*\\{([^}]*)\\}", Pattern.DOTALL);
        Matcher methodMatcher = methodPattern.matcher(text);
        while (methodMatcher.find()) {
            blocks.add(new CodeBlock("method", methodMatcher.group(1), methodMatcher.group(0)));
        }

        return blocks;
    }

    private List<QAPair> extractQAPairs(String text) {
        List<QAPair> pairs = new ArrayList<>();
        String[] lines = text.split("\n");

        String currentQuestion = null;
        StringBuilder currentAnswer = new StringBuilder();

        for (String line : lines) {
            if (line.startsWith("问:") || line.startsWith("Q:")) {
                if (currentQuestion != null && currentAnswer.length() > 0) {
                    pairs.add(new QAPair(currentQuestion, currentAnswer.toString()));
                }
                currentQuestion = line;
                currentAnswer = new StringBuilder();
            } else if (line.startsWith("答:") || line.startsWith("A:")) {
                currentAnswer.append(line).append("\n");
            } else if (currentAnswer.length() > 0) {
                currentAnswer.append(line).append("\n");
            }
        }

        if (currentQuestion != null && currentAnswer.length() > 0) {
            pairs.add(new QAPair(currentQuestion, currentAnswer.toString()));
        }

        return pairs;
    }

    private List<TextSegment> mergeTinyChunks(List<TextSegment> chunks, int minSize) {
        List<TextSegment> merged = new ArrayList<>();

        for (int i = 0; i < chunks.size(); i++) {
            TextSegment current = chunks.get(i);

            if (current.text().length() < minSize && i < chunks.size() - 1) {
                TextSegment next = chunks.get(i + 1);
                if (current.text().length() + next.text().length() <= minSize * 2) {
                    // 合并两个分块
                    String mergedContent = current.text() + "\n" + next.text();
                    Map<String, Object> mergedMetadata = new HashMap<>(current.metadata().toMap());
                    mergedMetadata.put("merged", "true");
                    Metadata metadata = new Metadata(mergedMetadata);
                    TextSegment mergedChunk = TextSegment.from(mergedContent, metadata);
                    merged.add(mergedChunk);

                    i++; // 跳过下一个分块
                    continue;
                }
            }

            merged.add(current);
        }

        return merged;
    }

    private List<TextSegment> optimizeChunkBoundaries(List<TextSegment> chunks) {
        // 简单的边界优化:确保分块在句子边界结束
        return chunks.stream()
                .map(chunk -> {
                    String text = chunk.text();
                    int lastPeriod = text.lastIndexOf('.');
                    int lastExclaim = text.lastIndexOf('!');
                    int lastQuestion = text.lastIndexOf('?');

                    int lastSentenceEnd = Math.max(lastPeriod, Math.max(lastExclaim, lastQuestion));

                    if (lastSentenceEnd > text.length() * 0.8) {
                        text = text.substring(0, lastSentenceEnd + 1);
                        return TextSegment.from(text, chunk.metadata());
                    }

                    return chunk;
                })
                .collect(Collectors.toList());
    }

    /**
     * 演示智能分块功能
     */
    public void demonstrateIntelligentChunking() {
        System.out.println("=== 编程导航智能文档分块演示 ===");

        // 测试不同类型的文档
        String codeDoc = """
            public class BinarySearch {
                public static int search(int[] arr, int target) {
                    int left = 0, right = arr.length - 1;
                    while (left <= right) {
                        int mid = left + (right - left) / 2;
                        if (arr[mid] == target) return mid;
                        if (arr[mid] < target) left = mid + 1;
                        else right = mid - 1;
                    }
                    return -1;
                }
            }
            """;

        String tutorialDoc = """
            # Java 学习教程
            
            ## 第一章:基础语法
            Java 是一种强类型的面向对象编程语言。它具有跨平台、安全性高、语法简洁等特点。
            
            ### 1.1 变量声明
            在 Java 中,变量必须先声明后使用。声明变量需要指定数据类型。
            
            ### 1.2 控制结构
            Java 提供了多种控制结构,包括 if-else、for、while 等。
            
            ## 第二章:面向对象
            面向对象是 Java 的核心特性,包括封装、继承、多态三大特性。
            """;

        String faqDoc = """
            问:Java 和 Python 有什么区别?
            答:Java 是编译型语言,需要先编译成字节码;Python 是解释型语言,可以直接运行。Java 语法更严格,Python 语法更简洁。
            
            问:如何选择合适的数据结构?
            答:根据操作需求选择:频繁查找用哈希表,有序数据用数组,动态插入删除用链表。
            """;

        // 测试不同分块策略
        testChunkingStrategy("代码文档", codeDoc, DocumentType.CODE);
        testChunkingStrategy("教程文档", tutorialDoc, DocumentType.TUTORIAL);
        testChunkingStrategy("FAQ文档", faqDoc, DocumentType.FAQ);
    }

    private void testChunkingStrategy(String docName, String content, DocumentType expectedType) {
        System.out.println("\n--- " + docName + " 分块测试 ---");

        Document document = Document.from(content);
        ChunkingOptions options = new ChunkingOptions.Builder()
                .targetUse(TargetUse.QA_SYSTEM)
                .minChunkSize(50)
                .maxChunkSize(800)
                .mergeTinyChunks(true)
                .optimizeBoundaries(true)
                .build();

        List<TextSegment> chunks = chunkDocument(document, options);

        System.out.println("文档类型: " + expectedType);
        System.out.println("分块数量: " + chunks.size());

        for (int i = 0; i < chunks.size(); i++) {
            TextSegment chunk = chunks.get(i);
            System.out.printf("分块 %d: 长度=%d, 方法=%s%n",
                    i + 1, chunk.text().length(), chunk.metadata().getString("method"));
            System.out.println("内容预览: " +
                    chunk.text().substring(0, Math.min(100, chunk.text().length())) + "...");
        }
    }

    // 内部类和枚举定义
    public enum DocumentType {
        CODE, TECHNICAL_DOC, TUTORIAL, FAQ, GENERAL
    }

    public enum ChunkingMethod {
        SLIDING_WINDOW, SEMANTIC, HIERARCHICAL, FUNCTION_BASED, QA_PAIR, SENTENCE_BOUNDARY
    }

    public enum TargetUse {
        QA_SYSTEM, SUMMARIZATION, SEARCH, ANALYSIS
    }

    public static class ChunkingStrategy {
        public final ChunkingMethod method;
        public final int chunkSize;
        public final int overlap;
        public final String description;

        public ChunkingStrategy(ChunkingMethod method, int chunkSize, int overlap, String description) {
            this.method = method;
            this.chunkSize = chunkSize;
            this.overlap = overlap;
            this.description = description;
        }
    }

    public static class ChunkingOptions {
        public final ChunkingStrategy preferredStrategy;
        public final TargetUse targetUse;
        public final int minChunkSize;
        public final int maxChunkSize;
        public final boolean mergeTinyChunks;
        public final boolean optimizeBoundaries;

        private ChunkingOptions(Builder builder) {
            this.preferredStrategy = builder.preferredStrategy;
            this.targetUse = builder.targetUse;
            this.minChunkSize = builder.minChunkSize;
            this.maxChunkSize = builder.maxChunkSize;
            this.mergeTinyChunks = builder.mergeTinyChunks;
            this.optimizeBoundaries = builder.optimizeBoundaries;
        }

        public static class Builder {
            private ChunkingStrategy preferredStrategy;
            private TargetUse targetUse = TargetUse.SEARCH;
            private int minChunkSize = 50;
            private int maxChunkSize = 1000;
            private boolean mergeTinyChunks = false;
            private boolean optimizeBoundaries = false;

            public Builder preferredStrategy(ChunkingStrategy strategy) {
                this.preferredStrategy = strategy;
                return this;
            }

            public Builder targetUse(TargetUse use) {
                this.targetUse = use;
                return this;
            }

            public Builder minChunkSize(int size) {
                this.minChunkSize = size;
                return this;
            }

            public Builder maxChunkSize(int size) {
                this.maxChunkSize = size;
                return this;
            }

            public Builder mergeTinyChunks(boolean merge) {
                this.mergeTinyChunks = merge;
                return this;
            }

            public Builder optimizeBoundaries(boolean optimize) {
                this.optimizeBoundaries = optimize;
                return this;
            }

            public ChunkingOptions build() {
                return new ChunkingOptions(this);
            }
        }
    }

    private static class DocumentAnalysis {
        public final DocumentType documentType;
        public final String language;
        public final int totalLength;
        public final int lineCount;
        public final int paragraphCount;
        public final StructureFeatures structure;
        public final double complexity;

        public DocumentAnalysis(DocumentType documentType, String language, int totalLength,
                                int lineCount, int paragraphCount, StructureFeatures structure,
                                double complexity) {
            this.documentType = documentType;
            this.language = language;
            this.totalLength = totalLength;
            this.lineCount = lineCount;
            this.paragraphCount = paragraphCount;
            this.structure = structure;
            this.complexity = complexity;
        }
    }

    private static class StructureFeatures {
        public final int headingCount;
        public final int listCount;
        public final int codeBlockCount;

        public StructureFeatures(int headingCount, int listCount, int codeBlockCount) {
            this.headingCount = headingCount;
            this.listCount = listCount;
            this.codeBlockCount = codeBlockCount;
        }
    }

    private static class ChunkingRule {
        // 自定义分块规则(可扩展)
    }

    private static class Section {
        public final String title;
        public final int level;
        public final String content;

        public Section(String title, int level, String content) {
            this.title = title;
            this.level = level;
            this.content = content;
        }
    }

    private static class CodeBlock {
        public final String type;
        public final String name;
        public final String content;

        public CodeBlock(String type, String name, String content) {
            this.type = type;
            this.name = name;
            this.content = content;
        }
    }

    private static class QAPair {
        public final String question;
        public final String answer;

        public QAPair(String question, String answer) {
            this.question = question;
            this.answer = answer;
        }
    }
}

通过这三个练习题,我们深入探索了向量数据库与嵌入技术的核心概念和实际应用。第一个练习展示了多种相似度计算方法的实现和对比;第二个练习构建了完整的性能监控系统,能够实时跟踪向量数据库的运行状态;第三个练习实现了智能文档分块系统,能够根据文档特征自动选择最优的分块策略。

这些练习不仅帮助我们理解了向量数据库的技术原理,更重要的是展示了如何在实际项目中应用这些技术来解决具体问题。无论是构建智能问答系统、实现语义搜索,还是优化文档处理流程,掌握这些技能都将为你的 AI 应用开发之路提供强有力的支持。

最近更新