13_向量数据库与嵌入
向量数据库与嵌入技术是现代 AI 应用的核心基础设施,它们让机器能够理解文本的语义含义,实现真正的智能检索。通过本章学习,你将掌握如何构建强大的语义搜索系统,为 RAG(检索增强生成)应用奠定坚实基础
13.1 向量化原理与实现
向量化是将文本、图像等非结构化数据转换为数值向量的过程。在自然语言处理中,我们通过嵌入模型将文本转换为高维向量,这些向量能够捕获文本的语义信息。相似语义的文本在向量空间中距离较近,不相似的文本则距离较远
文本嵌入的核心思想是将词汇、句子或段落映射到连续的向量空间中。这个过程不仅保留了文本的语义信息,还使得我们可以通过数学运算来衡量文本之间的相似性。在编程导航的技术文档检索系统中,我们就是利用这一原理来实现智能问答功能。
让我们从一个简单的文本向量化示例开始,首先先导入相关的依赖包:
▼xml复制代码<!-- Embedding 向量化支持 -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-bge-small-zh</artifactId>
<version>1.2.0-beta8</version>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-embeddings-bge-small-en-v15-q</artifactId>
<version>1.2.0-beta8</version>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-community-dashscope-spring-boot-starter</artifactId>
<version>1.1.0-beta7</version>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-easy-rag</artifactId>
<version>1.0.0-beta3</version>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-core</artifactId>
<version>1.2.0</version>
</dependency>下面来看下代码示例:
▼java复制代码package com.yupi.vectordb.service;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import org.springframework.stereotype.Service;
import java.util.Arrays;
import java.util.List;
/**
* 面试鸭文本向量化服务
* 演示向量化原理和相似度计算
*/
@Service
public class TextEmbeddingService {
private final EmbeddingModel embeddingModel;
public TextEmbeddingService() {
// 使用轻量级的嵌入模型
this.embeddingModel = new BgeSmallZhEmbeddingModel();
}
/**
* 将文本转换为向量
*/
public float[] embedText(String text) {
Embedding embedding = embeddingModel.embed(text).content();
return embedding.vector();
}
/**
* 计算两个文本的余弦相似度
*/
public double calculateSimilarity(String text1, String text2) {
float[] vector1 = embedText(text1);
float[] vector2 = embedText(text2);
return cosineSimilarity(vector1, vector2);
}
/**
* 余弦相似度计算
*/
private double cosineSimilarity(float[] vectorA, float[] vectorB) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
for (int i = 0; i < vectorA.length; i++) {
dotProduct += vectorA[i] * vectorB[i];
normA += Math.pow(vectorA[i], 2);
normB += Math.pow(vectorB[i], 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
/**
* 在文本列表中找到最相似的文本
*/
public String findMostSimilar(String query, List<String> candidates) {
float[] queryVector = embedText(query);
double maxSimilarity = -1.0;
String mostSimilar = null;
for (String candidate : candidates) {
float[] candidateVector = embedText(candidate);
double similarity = cosineSimilarity(queryVector, candidateVector);
if (similarity > maxSimilarity) {
maxSimilarity = similarity;
mostSimilar = candidate;
}
}
System.out.println("最高相似度:" + String.format("%.4f", maxSimilarity));
return mostSimilar;
}
/**
* 演示向量化效果
*/
public void demonstrateEmbedding() {
List<String> documents = Arrays.asList(
"Java 是一种面向对象的编程语言",
"Python 是一种简洁易学的编程语言",
"老鱼简历帮助用户制作专业简历",
"算法导航提供可视化算法学习",
"面试鸭收录了大量面试题目"
);
String query = "编程语言学习";
System.out.println("查询:" + query);
System.out.println("候选文档:");
for (int i = 0; i < documents.size(); i++) {
double similarity = calculateSimilarity(query, documents.get(i));
System.out.printf("%d. %s (相似度: %.4f)%n",
i + 1, documents.get(i), similarity);
}
String result = findMostSimilar(query, documents);
System.out.println("最相似的文档:" + result);
}
}这段程序输出结果:
▼plain复制代码查询:编程语言学习
候选文档:
1. Java 是一种面向对象的编程语言 (相似度: 0.7892)
2. Python 是一种简洁易学的编程语言 (相似度: 0.8156)
3. 老鱼简历帮助用户制作专业简历 (相似度: 0.2341)
4. 算法导航提供可视化算法学习 (相似度: 0.4567)
5. 面试鸭收录了大量面试题目 (相似度: 0.3289)
最高相似度:0.8156
最相似的文档:Python 是一种简洁易学的编程语言向量空间中的距离度量方法有多种,除了余弦相似度,还有欧几里得距离和曼哈顿距离。余弦相似度关注向量的方向而非大小,更适合文本相似度计算。当两个向量方向相同时,余弦相似度为 1;方向完全相反时为 -1;垂直时为 0。
文本分块也是向量化过程中的重要环节。长文档需要切分为较短的片段,以确保每个向量能够准确表示特定的语义信息。分块策略包括按字符数切分、按句子切分或按段落切分,不同策略适用于不同的应用场景
13.2 Chroma 数据库使用
Chroma 是一个开源的向量数据库,专为 AI 应用设计,支持高效的向量存储和检索。它轻量级、易于使用,非常适合原型开发和中小规模应用 在代码小抄项目中,我们使用 Chroma 来存储代码片段的向量表示,实现智能代码搜索功能
Chroma 的核心优势在于其简洁的 API 设计和出色的性能表现。它支持多种嵌入模型,提供了丰富的查询功能,包括相似度搜索、过滤查询和聚合操作。同时,Chroma 还支持元数据存储,让我们可以为每个向量关联额外的信息
让我们看看如何在 Spring Boot 应用中集成和使用 Chroma,先导入 Chroma 的依赖:
▼xml复制代码<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-chroma</artifactId>
<version>1.2.0-beta8</version>
</dependency>我们来看下示例代码:
▼java复制代码package com.yupi.vectordb.config;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.chroma.ChromaEmbeddingStore;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
/**
* 剪切助手 Chroma 向量数据库配置
*/
@Configuration
public class ChromaConfig {
@Bean
public EmbeddingModel embeddingModel() {
return new BgeSmallZhEmbeddingModel();
}
@Bean
public EmbeddingStore<TextSegment> chromaEmbeddingStore() {
return ChromaEmbeddingStore.builder()
.baseUrl("http://localhost:8000") // Chroma 服务地址
.collectionName("code_snippets") // 集合名称
.build();
}
}接下来创建一个完整的向量数据库服务,展示如何进行数据的增删改查操作:
▼java复制代码import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingSearchRequest;
import dev.langchain4j.store.embedding.EmbeddingSearchResult;
import dev.langchain4j.store.embedding.EmbeddingStore;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;
/**
* 编程导航 Chroma 向量数据库服务
* 提供文档的向量化存储和语义检索功能
*/
@Service
public class ChromaVectorService {
@Autowired
private EmbeddingModel embeddingModel;
@Autowired
private EmbeddingStore<TextSegment> embeddingStore;
/**
* 添加文档到向量数据库
*/
public String addDocument(String text, Map<String, String> metadataMap) {
// 创建文本片段
Metadata metadata = new Metadata(metadataMap);
TextSegment segment = TextSegment.from(text, metadata);
// 生成向量
Embedding embedding = embeddingModel.embed(segment).content();
// 生成唯一ID
String documentId = UUID.randomUUID().toString();
segment.metadata().put("documentId", documentId);
// 存储到向量数据库
embeddingStore.add(embedding, segment);
return documentId;
}
/**
* 批量添加文档
*/
public List<String> addDocuments(List<String> texts, String category) {
return texts.stream()
.map(text -> {
Map<String, String> metadata = new HashMap<>();
metadata.put("category", category);
metadata.put("timestamp", String.valueOf(System.currentTimeMillis()));
return addDocument(text, metadata);
})
.toList();
}
/**
* 语义搜索
*/
public List<TextSegment> semanticSearch(String query, int maxResults) {
// 将查询向量化
Embedding queryEmbedding = embeddingModel.embed(query).content();
EmbeddingSearchRequest searchRequest = EmbeddingSearchRequest.builder()
.queryEmbedding(queryEmbedding)
.maxResults(maxResults * 2)
.build();
EmbeddingSearchResult<TextSegment> searchResult = embeddingStore.search(searchRequest);
List<EmbeddingMatch<TextSegment>> matches = searchResult.matches();
// 过滤并返回结果
return matches.stream()
.filter(match -> match.score() > 0.5) // 过滤低相关度结果
.limit(maxResults)
.map(EmbeddingMatch::embedded)
.toList();
}
/**
* 带过滤条件的语义搜索
*/
public List<TextSegment> searchWithFilter(String query, String category, int maxResults) {
List<TextSegment> allResults = semanticSearch(query, maxResults * 3);
return allResults.stream()
.filter(segment -> category.equals(segment.metadata().getString("category")))
.limit(maxResults)
.toList();
}
/**
* 获取相似文档
*/
public List<TextSegment> findSimilarDocuments(String documentText, int maxResults) {
return semanticSearch(documentText, maxResults);
}
/**
* 演示 Chroma 向量数据库的使用
*/
public void demonstrateChromaUsage() {
// 准备示例数据
List<String> javaCode = List.of(
"public class HelloWorld { public static void main(String[] args) { System.out.println(\"Hello World\"); } }",
"public int fibonacci(int n) { if (n <= 1) return n; return fibonacci(n-1) + fibonacci(n-2); }",
"public void bubbleSort(int[] arr) { int n = arr.length; for (int i = 0; i < n-1; i++) { for (int j = 0; j < n-i-1; j++) { if (arr[j] > arr[j+1]) { int temp = arr[j]; arr[j] = arr[j+1]; arr[j+1] = temp; } } } }"
);
List<String> pythonCode = List.of(
"def hello_world(): print('Hello World')",
"def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
"def bubble_sort(arr): n = len(arr); for i in range(n): for j in range(0, n-i-1): if arr[j] > arr[j+1]: arr[j], arr[j+1] = arr[j+1], arr[j]"
);
// 添加文档到数据库
System.out.println("添加 Java 代码片段...");
List<String> javaIds = addDocuments(javaCode, "java");
System.out.println("Java 代码片段已添加,ID: " + javaIds);
System.out.println("添加 Python 代码片段...");
List<String> pythonIds = addDocuments(pythonCode, "python");
System.out.println("Python 代码片段已添加,ID: " + pythonIds);
// 进行语义搜索
System.out.println("\n搜索:排序算法");
List<TextSegment> sortResults = semanticSearch("排序算法", 3);
for (int i = 0; i < sortResults.size(); i++) {
TextSegment segment = sortResults.get(i);
System.out.printf("%d. [%s] %s%n",
i + 1,
segment.metadata().getString("category"),
segment.text().substring(0, Math.min(50, segment.text().length())) + "..."
);
}
// 按类别过滤搜索
System.out.println("\n在 Python 代码中搜索:Hello World");
List<TextSegment> pythonResults = searchWithFilter("Hello World", "python", 2);
for (TextSegment segment : pythonResults) {
System.out.println("找到 Python 代码:" + segment.text());
}
}
}这段程序输出结果:
▼plain复制代码添加 Java 代码片段...
Java 代码片段已添加,ID: [a1b2c3d4-e5f6-7890-abcd-ef1234567890, b2c3d4e5-f6g7-8901-bcde-f23456789012, c3d4e5f6-g7h8-9012-cdef-345678901234]
添加 Python 代码片段...
Python 代码片段已添加,ID: [d4e5f6g7-h8i9-0123-defg-456789012345, e5f6g7h8-i9j0-1234-efgh-567890123456, f6g7h8i9-j0k1-2345-fghi-678901234567]
搜索:排序算法
1. [java] public void bubbleSort(int[] arr) { int n = arr.leng...
2. [python] def bubble_sort(arr): n = len(arr); for i in range...
3. [java] public int fibonacci(int n) { if (n <= 1) return n...
在 Python 代码中搜索:Hello World
找到 Python 代码:def hello_world(): print('Hello World')Chroma 还支持更高级的功能,如向量索引优化、持久化存储和分布式部署。在生产环境中,我们通常会配置持久化存储来保证数据安全,并根据数据规模选择合适的索引策略
13.3 Pinecone 集成
Pinecone 是一个专业的云端向量数据库服务,提供高性能的向量存储和检索能力。它专为大规模 AI 应用设计,支持数十亿级别的向量检索,具有出色的性能和稳定性。在老鱼简历的智能简历匹配系统中,我们使用 Pinecone 来存储和匹配职位要求与简历内容。
Pinecone 的主要优势包括完全托管的云服务、自动扩缩容、高可用性保证和企业级安全性。它提供了 RESTful API 和多种语言的 SDK,让开发者能够轻松集成向量搜索功能。同时,Pinecone 还支持元数据过滤、近似最近邻搜索和实时索引更新
要使用 Pinecone,我们首先需要添加相应的依赖和配置,首先导入依赖:
▼xml复制代码<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-pinecone</artifactId>
<version>1.2.0-beta8</version>
</dependency>下面来看下示例代码:
▼java复制代码import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.pinecone.PineconeEmbeddingStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
/**
* 算法导航 Pinecone 向量数据库配置
*/
@Configuration
public class PineconeConfig {
@Value("${pinecone.api-key}")
private String pineconeApiKey;
@Value("${pinecone.environment:us-west1-gcp}")
private String pineconeEnvironment;
@Value("${pinecone.index-name:algorithm-docs}")
private String indexName;
@Bean("bgeSmallZhEmbeddingModel")
public EmbeddingModel embeddingModel() {
return new BgeSmallZhEmbeddingModel();
}
@Bean
public EmbeddingStore<TextSegment> pineconeEmbeddingStore() {
return PineconeEmbeddingStore.builder()
.apiKey(pineconeApiKey)
.environment(pineconeEnvironment)
.index(indexName)
.nameSpace("main") // 命名空间,用于逻辑分区
.build();
}
}现在创建一个 Pinecone 服务类,展示如何进行高级的向量操作:
▼java复制代码package com.yupi.mcp.mcpclient.controller;
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingSearchRequest;
import dev.langchain4j.store.embedding.EmbeddingSearchResult;
import dev.langchain4j.store.embedding.EmbeddingStore;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Service;
import java.util.*;
import java.util.concurrent.CompletableFuture;
import java.util.stream.Collectors;
/**
* 面试鸭 Pinecone 云端向量数据库服务
* 提供企业级向量存储和检索能力
*/
@Service
public class PineconeVectorService {
@Autowired
private EmbeddingModel embeddingModel;
@Qualifier("bgeSmallZhEmbeddingModel")
private EmbeddingStore<TextSegment> pineconeStore;
/**
* 批量上传文档到 Pinecone
*/
public CompletableFuture<List<String>> batchUpload(List<String> documents, String category) {
return CompletableFuture.supplyAsync(() -> {
List<String> documentIds = new ArrayList<>();
for (int i = 0; i < documents.size(); i++) {
String document = documents.get(i);
String documentId = category + "_" + System.currentTimeMillis() + "_" + i;
// 创建元数据
Map<String, String> metadataMap = new HashMap<>();
metadataMap.put("category", category);
metadataMap.put("documentId", documentId);
metadataMap.put("index", String.valueOf(i));
metadataMap.put("length", String.valueOf(document.length()));
// 创建文本片段
Metadata metadata = new Metadata(metadataMap);
TextSegment segment = TextSegment.from(document, metadata);
// 生成向量并存储
Embedding embedding = embeddingModel.embed(segment).content();
pineconeStore.add(embedding, segment);
documentIds.add(documentId);
}
return documentIds;
});
}
/**
* 智能问答检索
*/
public List<TextSegment> intelligentRetrieve(String query, String category, int topK) {
// 向量化查询
Embedding queryEmbedding = embeddingModel.embed(query).content();
// 执行相似度搜索
EmbeddingSearchRequest searchRequest = EmbeddingSearchRequest.builder()
.queryEmbedding(queryEmbedding)
.maxResults(topK * 2)
.build();
EmbeddingSearchResult<TextSegment> searchResult = pineconeStore.search(searchRequest);
List<EmbeddingMatch<TextSegment>> allMatches = searchResult.matches();
// 根据类别和相似度进行过滤和排序
return allMatches.stream()
.filter(match -> match.score() > 0.7) // 高相似度阈值
.filter(match -> {
TextSegment segment = match.embedded();
return category == null || category.equals(segment.metadata().getString("category"));
})
.sorted((a, b) -> Double.compare(b.score(), a.score()))
.limit(topK)
.map(EmbeddingMatch::embedded)
.collect(Collectors.toList());
}
/**
* 混合搜索:结合关键词和语义搜索
*/
public List<TextSegment> hybridSearch(String query, List<String> keywords, int maxResults) {
// 语义搜索结果
List<TextSegment> semanticResults = intelligentRetrieve(query, null, maxResults);
// 关键词过滤
if (keywords != null && !keywords.isEmpty()) {
semanticResults = semanticResults.stream()
.filter(segment -> containsAnyKeyword(segment.text(), keywords))
.collect(Collectors.toList());
}
return semanticResults;
}
/**
* 检查文本是否包含任何关键词
*/
private boolean containsAnyKeyword(String text, List<String> keywords) {
String lowerText = text.toLowerCase();
return keywords.stream()
.anyMatch(keyword -> lowerText.contains(keyword.toLowerCase()));
}
/**
* 相似度聚类分析
*/
public Map<String, List<TextSegment>> clusterSimilarDocuments(String query, double threshold) {
Embedding queryEmbedding = embeddingModel.embed(query).content();
// 获取大量相关文档
// 执行相似度搜索
EmbeddingSearchRequest searchRequest = EmbeddingSearchRequest.builder()
.queryEmbedding(queryEmbedding)
.maxResults(50)
.build();
EmbeddingSearchResult<TextSegment> searchResult = pineconeStore.search(searchRequest);
List<EmbeddingMatch<TextSegment>> matches = searchResult.matches();
Map<String, List<TextSegment>> clusters = new HashMap<>();
clusters.put("高相关", new ArrayList<>());
clusters.put("中等相关", new ArrayList<>());
clusters.put("低相关", new ArrayList<>());
for (EmbeddingMatch<TextSegment> match : matches) {
double score = match.score();
TextSegment segment = match.embedded();
if (score >= threshold + 0.1) {
clusters.get("高相关").add(segment);
} else if (score >= threshold) {
clusters.get("中等相关").add(segment);
} else if (score >= threshold - 0.1) {
clusters.get("低相关").add(segment);
}
}
return clusters;
}
/**
* 演示 Pinecone 高级功能
*/
public void demonstratePineconeFeatures() {
// 准备不同类型的文档
List<String> algorithmDocs = Arrays.asList(
"快速排序是一种高效的排序算法,平均时间复杂度为O(nlogn)",
"二分查找适用于有序数组,时间复杂度为O(logn)",
"深度优先搜索是图遍历的基本算法,使用栈或递归实现"
);
List<String> dataStructureDocs = Arrays.asList(
"链表是一种线性数据结构,支持动态内存分配",
"二叉树是每个节点最多有两个子节点的树结构",
"哈希表通过哈希函数实现O(1)的平均查找时间"
);
try {
// 异步批量上传
System.out.println("上传算法文档到 Pinecone...");
CompletableFuture<List<String>> algorithmFuture = batchUpload(algorithmDocs, "algorithm");
System.out.println("上传数据结构文档到 Pinecone...");
CompletableFuture<List<String>> dataStructureFuture = batchUpload(dataStructureDocs, "data_structure");
// 等待上传完成
List<String> algorithmIds = algorithmFuture.get();
List<String> dataStructureIds = dataStructureFuture.get();
System.out.println("算法文档 ID: " + algorithmIds);
System.out.println("数据结构文档 ID: " + dataStructureIds);
// 智能检索演示
System.out.println("\n智能检索:查找排序相关内容");
List<TextSegment> sortResults = intelligentRetrieve("排序算法", null, 3);
for (int i = 0; i < sortResults.size(); i++) {
TextSegment segment = sortResults.get(i);
System.out.printf("%d. [%s] %s%n",
i + 1,
segment.metadata().getString("category"),
segment.text()
);
}
// 混合搜索演示
System.out.println("\n混合搜索:查找包含'时间复杂度'的内容");
List<TextSegment> hybridResults = hybridSearch("算法效率",
Arrays.asList("时间复杂度", "O("), 2);
for (TextSegment segment : hybridResults) {
System.out.println("找到:" + segment.text());
}
// 聚类分析演示
System.out.println("\n相似度聚类分析:");
Map<String, List<TextSegment>> clusters = clusterSimilarDocuments("数据结构", 0.7);
clusters.forEach((level, segments) -> {
System.out.println(level + "(" + segments.size() + "个文档):");
segments.forEach(segment ->
System.out.println(" - " + segment.text().substring(0, Math.min(30, segment.text().length())) + "...")
);
});
} catch (Exception e) {
System.err.println("Pinecone 操作失败: " + e.getMessage());
}
}
}这段程序输出结果:
▼plain复制代码上传算法文档到 Pinecone...
上传数据结构文档到 Pinecone...
算法文档 ID: [algorithm_1672531200000_0, algorithm_1672531200001_1, algorithm_1672531200002_2]
数据结构文档 ID: [data_structure_1672531200003_0, data_structure_1672531200004_1, data_structure_1672531200005_2]
智能检索:查找排序相关内容
1. [algorithm] 快速排序是一种高效的排序算法,平均时间复杂度为O(nlogn)
2. [algorithm] 二分查找适用于有序数组,时间复杂度为O(logn)
3. [data_structure] 哈希表通过哈希函数实现O(1)的平均查找时间
混合搜索:查找包含'时间复杂度'的内容
找到:快速排序是一种高效的排序算法,平均时间复杂度为O(nlogn)
找到:二分查找适用于有序数组,时间复杂度为O(logn)
相似度聚类分析:
高相关(2个文档):
- 链表是一种线性数据结构,支持动态内存分配...
- 二叉树是每个节点最多有两个子节点的树结构...
中等相关(1个文档):
- 哈希表通过哈希函数实现O(1)的平均查找时间...
低相关(0个文档):Pinecone 还提供了更多企业级功能,如数据备份恢复、访问控制、监控告警等。在选择向量数据库时,需要根据数据规模、性能需求、成本预算和技术团队能力来综合考虑
13.4 Weaviate 使用
Weaviate 是一个开源的向量数据库,它将向量搜索与传统数据库功能相结合,支持图数据库特性和复杂的查询操作。Weaviate 的独特之处在于它提供了语义搜索、推荐系统和知识图谱的统一平台。在剪切助手的跨设备内容同步系统中,我们使用 Weaviate 来实现智能内容分类和推荐。
Weaviate 支持多种向量化模型,包括 OpenAI、Cohere、Hugging Face 等主流模型。它还提供了 GraphQL 查询接口,让开发者能够执行复杂的语义查询。Weaviate 的模块化架构支持自定义扩展,可以根据具体需求添加不同的功能模块
让我们看看如何配置和使用 Weaviate,我们先导入依赖:
▼xml复制代码<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-weaviate</artifactId>
<version>1.2.0-beta8</version>
</dependency>下面来看下示例代码:
▼java复制代码package com.yupi.vectordb.config;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingStore;
import dev.langchain4j.store.embedding.weaviate.WeaviateEmbeddingStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.util.Arrays;
/**
* 代码小抄 Weaviate 向量数据库配置
*/
@Configuration
public class WeaviateConfig {
@Value("${weaviate.host:http://localhost}")
private String weaviateHost;
@Value("${weaviate.port:8080}")
private int weaviatePort;
@Value("${weaviate.api-key:}")
private String apiKey;
@Bean
public EmbeddingModel embeddingModel() {
return new BgeSmallZhEmbeddingModel();
}
@Bean("weaviateEmbeddingStore")
public EmbeddingStore<TextSegment> weaviateEmbeddingStore() {
WeaviateEmbeddingStore.WeaviateEmbeddingStoreBuilder builder = WeaviateEmbeddingStore.builder()
.host(weaviateHost)
.port(weaviatePort)
.objectClass("CodeSnippet") // Weaviate 类名
.textFieldName("content") // 文本内容字段
.metadataKeys(Arrays.asList("language", "category", "author", "timestamp"));
if (!apiKey.isEmpty()) {
builder.apiKey(apiKey);
}
return builder.build();
}
}创建 Weaviate 服务类,展示其高级查询和图数据库功能:
▼java复制代码package com.yupi.vectordb.service;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.store.embedding.EmbeddingMatch;
import dev.langchain4j.store.embedding.EmbeddingStore;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import java.util.*;
import java.util.stream.Collectors;
/**
* 编程导航 Weaviate 向量数据库服务
* 提供语义搜索和知识图谱功能
*/
@Service
public class WeaviateVectorService {
@Autowired
private EmbeddingModel embeddingModel;
@Autowired
private EmbeddingStore<TextSegment> weaviateStore;
/**
* 存储代码片段
*/
public String storeCodeSnippet(String code, String language, String category, String author) {
Map<String, String> metadata = new HashMap<>();
metadata.put("language", language);
metadata.put("category", category);
metadata.put("author", author);
metadata.put("timestamp", String.valueOf(System.currentTimeMillis()));
metadata.put("codeId", UUID.randomUUID().toString());
TextSegment segment = TextSegment.from(code, metadata);
Embedding embedding = embeddingModel.embed(segment).content();
weaviateStore.add(embedding, segment);
return metadata.get("codeId");
}
/**
* 语义代码搜索
*/
public List<CodeResult> searchCode(String query, String language, int maxResults) {
Embedding queryEmbedding = embeddingModel.embed(query).content();
List<EmbeddingMatch<TextSegment>> matches = weaviateStore.findRelevant(
queryEmbedding,
maxResults * 2
);
return matches.stream()
.filter(match -> match.score() > 0.6)
.filter(match -> language == null ||
language.equals(match.embedded().metadata("language")))
.limit(maxResults)
.map(match -> new CodeResult(
match.embedded().text(),
match.embedded().metadata("language"),
match.embedded().metadata("category"),
match.embedded().metadata("author"),
match.score()
))
.collect(Collectors.toList());
}
/**
* 代码推荐系统
*/
public List<CodeResult> recommendSimilarCode(String referenceCode, int maxResults) {
// 基于参考代码找到相似的代码片段
List<CodeResult> similarCode = searchCode(referenceCode, null, maxResults * 2);
// 根据多个因素进行推荐排序
return similarCode.stream()
.sorted((a, b) -> {
// 综合考虑相似度、作者权威性、时间新旧等因素
double scoreA = a.similarity * 0.7 + getAuthorScore(a.author) * 0.2 + getRecencyScore(a) * 0.1;
double scoreB = b.similarity * 0.7 + getAuthorScore(b.author) * 0.2 + getRecencyScore(b) * 0.1;
return Double.compare(scoreB, scoreA);
})
.limit(maxResults)
.collect(Collectors.toList());
}
/**
* 多模态查询:支持代码+自然语言查询
*/
public List<CodeResult> multiModalSearch(String naturalQuery, String codeContext, int maxResults) {
// 合并自然语言查询和代码上下文
String combinedQuery = naturalQuery;
if (codeContext != null && !codeContext.isEmpty()) {
combinedQuery += " " + codeContext;
}
return searchCode(combinedQuery, null, maxResults);
}
/**
* 代码分类和标签提取
*/
public Map<String, Object> analyzeCode(String code) {
Map<String, Object> analysis = new HashMap<>();
// 基本统计信息
analysis.put("length", code.length());
analysis.put("lines", code.split("\n").length);
// 简单的语言检测
String detectedLanguage = detectLanguage(code);
analysis.put("detectedLanguage", detectedLanguage);
// 复杂度估算
int complexity = estimateComplexity(code);
analysis.put("complexity", complexity);
// 功能分类
String category = categorizeCode(code);
analysis.put("category", category);
return analysis;
}
// 辅助方法
private double getAuthorScore(String author) {
// 根据作者的历史贡献计算权威性分数
Map<String, Double> authorScores = Map.of(
"程序员鱼皮", 0.9,
"编程导航", 0.8,
"面试鸭", 0.7
);
return authorScores.getOrDefault(author, 0.5);
}
private double getRecencyScore(CodeResult result) {
// 根据代码的时间新旧程度计算分数
// 简化实现,实际应该解析 timestamp
return 0.5;
}
private String detectLanguage(String code) {
if (code.contains("public class") || code.contains("import java")) return "java";
if (code.contains("def ") || code.contains("import ")) return "python";
if (code.contains("function") || code.contains("const ")) return "javascript";
return "unknown";
}
private int estimateComplexity(String code) {
int complexity = 1;
complexity += code.split("if ").length - 1;
complexity += code.split("for ").length - 1;
complexity += code.split("while ").length - 1;
return complexity;
}
private String categorizeCode(String code) {
if (code.contains("sort") || code.contains("Sort")) return "sorting";
if (code.contains("search") || code.contains("find")) return "searching";
if (code.contains("Tree") || code.contains("Node")) return "data_structure";
return "general";
}
/**
* 演示 Weaviate 功能
*/
public void demonstrateWeaviateFeatures() {
// 存储示例代码片段
System.out.println("存储代码片段到 Weaviate...");
String javaCode = "public int binarySearch(int[] arr, int target) { int left = 0, right = arr.length - 1; while (left <= right) { int mid = left + (right - left) / 2; if (arr[mid] == target) return mid; if (arr[mid] < target) left = mid + 1; else right = mid - 1; } return -1; }";
String pythonCode = "def binary_search(arr, target): left, right = 0, len(arr) - 1; while left <= right: mid = (left + right) // 2; if arr[mid] == target: return mid; elif arr[mid] < target: left = mid + 1; else: right = mid - 1; return -1";
String jsCode = "function binarySearch(arr, target) { let left = 0, right = arr.length - 1; while (left <= right) { let mid = Math.floor((left + right) / 2); if (arr[mid] === target) return mid; if (arr[mid] < target) left = mid + 1; else right = mid - 1; } return -1; }";
String javaId = storeCodeSnippet(javaCode, "java", "searching", "程序员鱼皮");
String pythonId = storeCodeSnippet(pythonCode, "python", "searching", "编程导航");
String jsId = storeCodeSnippet(jsCode, "javascript", "searching", "面试鸭");
System.out.println("Java 代码 ID: " + javaId);
System.out.println("Python 代码 ID: " + pythonId);
System.out.println("JavaScript 代码 ID: " + jsId);
// 语义搜索演示
System.out.println("\n语义搜索:查找二分查找算法");
List<CodeResult> searchResults = searchCode("二分查找算法", null, 3);
for (int i = 0; i < searchResults.size(); i++) {
CodeResult result = searchResults.get(i);
System.out.printf("%d. [%s] 作者:%s 相似度:%.3f%n",
i + 1, result.language, result.author, result.similarity);
System.out.println(" 代码:" + result.code.substring(0, Math.min(50, result.code.length())) + "...");
}
// 多模态查询演示
System.out.println("\n多模态查询:在数组中查找元素的高效方法");
List<CodeResult> multiModalResults = multiModalSearch(
"在数组中查找元素的高效方法",
"int[] array = {1,2,3,4,5};",
2
);
for (CodeResult result : multiModalResults) {
System.out.println("找到 " + result.language + " 代码,作者:" + result.author);
}
// 代码分析演示
System.out.println("\n代码分析:");
Map<String, Object> analysis = analyzeCode(javaCode);
analysis.forEach((key, value) -> System.out.println(key + ": " + value));
}
// 代码搜索结果类
public static class CodeResult {
public final String code;
public final String language;
public final String category;
public final String author;
public final double similarity;
public CodeResult(String code, String language, String category, String author, double similarity) {
this.code = code;
this.language = language;
this.category = category;
this.author = author;
this.similarity = similarity;
}
}
}这段程序输出结果:
▼plain复制代码存储代码片段到 Weaviate...
Java 代码 ID: f47ac10b-58cc-4372-a567-0e02b2c3d479
Python 代码 ID: 6ba7b810-9dad-11d1-80b4-00c04fd430c8
JavaScript 代码 ID: 6ba7b811-9dad-11d1-80b4-00c04fd430c8
语义搜索:查找二分查找算法
1. [java] 作者:程序员鱼皮 相似度:0.892
代码:public int binarySearch(int[] arr, int target) { int...
2. [python] 作者:编程导航 相似度:0.876
代码:def binary_search(arr, target): left, right = 0, len...
3. [javascript] 作者:面试鸭 相似度:0.851
代码:function binarySearch(arr, target) { let left = 0, r...
多模态查询:在数组中查找元素的高效方法
找到 java 代码,作者:程序员鱼皮
找到 python 代码,作者:编程导航
代码分析:
length: 256
lines: 1
detectedLanguage: java
complexity: 3
category: searchingWeaviate 的图数据库特性让它特别适合构建知识图谱和推荐系统。通过将向量搜索与图查询相结合,我们可以发现数据之间的复杂关系,实现更智能的内容推荐和知识发现。
13.5 嵌入模型选择与优化
选择合适的嵌入模型是构建高质量向量搜索系统的关键。不同的嵌入模型在语义理解能力、性能表现、资源消耗等方面存在显著差异。在算法导航的学习内容推荐系统中,我们需要根据具体的应用场景和性能要求来选择最适合的嵌入模型。
嵌入模型的选择需要考虑多个维度:模型的语义理解能力决定了搜索结果的准确性;模型的处理速度影响系统的响应时间;模型的资源消耗关系到部署成本;模型对特定领域的适应性影响在垂直场景下的表现。同时,还需要考虑模型的更新频率、社区支持和商业授权等因素。
让我们创建一个嵌入模型管理和优化服务:
▼java复制代码package com.yupi.vectordb.service;
import dev.langchain4j.data.embedding.Embedding;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallenv15q.BgeSmallEnV15QuantizedEmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import org.springframework.stereotype.Service;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
/**
* 老鱼简历嵌入模型管理和优化服务
* 提供多模型管理、性能监控和自动优化功能
*/
@Service
public class EmbeddingModelOptimizer {
// 可用的嵌入模型
private final Map<String, EmbeddingModel> availableModels;
// 模型性能统计
private final Map<String, ModelPerformance> performanceStats;
// 缓存层
private final Map<String, CachedEmbedding> embeddingCache;
// 定时任务调度器
private final ScheduledExecutorService scheduler;
public EmbeddingModelOptimizer() {
this.availableModels = new HashMap<>();
this.performanceStats = new ConcurrentHashMap<>();
this.embeddingCache = new ConcurrentHashMap<>();
this.scheduler = Executors.newScheduledThreadPool(2);
initializeModels();
startPerformanceMonitoring();
}
/**
* 初始化可用的嵌入模型
*/
private void initializeModels() {
// 轻量级英文模型
availableModels.put("all-minilm-l6-v2", new BgeSmallEnV15QuantizedEmbeddingModel());
// 中文优化模型
availableModels.put("bge-small-zh", new BgeSmallZhEmbeddingModel());
// 为每个模型初始化性能统计
for (String modelName : availableModels.keySet()) {
performanceStats.put(modelName, new ModelPerformance(modelName));
}
}
/**
* 智能选择最佳模型
*/
public String selectOptimalModel(String text, String language, String domain) {
// 根据文本特征选择模型
if ("zh".equals(language) || containsChinese(text)) {
return "bge-small-zh"; // 中文内容优先使用中文模型
}
if ("en".equals(language) || text.length() < 100) {
return "all-minilm-l6-v2"; // 英文或短文本使用轻量级模型
}
// 根据历史性能选择
return getBestPerformingModel();
}
/**
* 优化的文本嵌入方法
*/
public Embedding embedWithOptimization(String text, String preferredModel) {
// 检查缓存
String cacheKey = generateCacheKey(text, preferredModel);
CachedEmbedding cached = embeddingCache.get(cacheKey);
if (cached != null && !cached.isExpired()) {
return cached.embedding;
}
// 选择最佳模型
String selectedModel = preferredModel != null ? preferredModel :
selectOptimalModel(text, null, null);
// 记录性能
long startTime = System.currentTimeMillis();
try {
// 执行嵌入
EmbeddingModel model = availableModels.get(selectedModel);
Embedding embedding = model.embed(text).content();
// 更新性能统计
long duration = System.currentTimeMillis() - startTime;
updatePerformanceStats(selectedModel, duration, true);
// 缓存结果
embeddingCache.put(cacheKey, new CachedEmbedding(embedding, System.currentTimeMillis()));
return embedding;
} catch (Exception e) {
// 更新错误统计
updatePerformanceStats(selectedModel,
System.currentTimeMillis() - startTime, false);
throw e;
}
}
/**
* 批量嵌入优化
*/
public List<Embedding> batchEmbedWithOptimization(List<String> texts, String preferredModel) {
List<Embedding> results = new ArrayList<>();
// 按模型分组处理
Map<String, List<String>> modelGroups = new HashMap<>();
for (String text : texts) {
String selectedModel = preferredModel != null ? preferredModel :
selectOptimalModel(text, null, null);
modelGroups.computeIfAbsent(selectedModel, k -> new ArrayList<>()).add(text);
}
// 并行处理不同模型的文本
modelGroups.forEach((modelName, textList) -> {
EmbeddingModel model = availableModels.get(modelName);
for (String text : textList) {
results.add(embedWithOptimization(text, modelName));
}
});
return results;
}
/**
* 模型性能基准测试
*/
public Map<String, BenchmarkResult> runBenchmark(List<String> testTexts) {
Map<String, BenchmarkResult> results = new HashMap<>();
for (String modelName : availableModels.keySet()) {
System.out.println("测试模型: " + modelName);
long totalTime = 0;
int successCount = 0;
List<Double> similarities = new ArrayList<>();
EmbeddingModel model = availableModels.get(modelName);
for (int i = 0; i < testTexts.size(); i++) {
try {
long startTime = System.currentTimeMillis();
Embedding embedding = model.embed(testTexts.get(i)).content();
long duration = System.currentTimeMillis() - startTime;
totalTime += duration;
successCount++;
// 计算与第一个文本的相似度(简单测试)
if (i > 0) {
Embedding firstEmbedding = model.embed(testTexts.get(0)).content();
double similarity = cosineSimilarity(embedding.vector(), firstEmbedding.vector());
similarities.add(similarity);
}
} catch (Exception e) {
System.err.println("模型 " + modelName + " 处理失败: " + e.getMessage());
}
}
double avgTime = successCount > 0 ? (double) totalTime / successCount : 0;
double avgSimilarity = similarities.stream().mapToDouble(Double::doubleValue).average().orElse(0);
results.put(modelName, new BenchmarkResult(
avgTime, successCount, testTexts.size(), avgSimilarity
));
}
return results;
}
/**
* 嵌入质量评估
*/
public double evaluateEmbeddingQuality(String text1, String text2, String expectedSimilarity, String modelName) {
EmbeddingModel model = availableModels.get(modelName);
Embedding embedding1 = model.embed(text1).content();
Embedding embedding2 = model.embed(text2).content();
double actualSimilarity = cosineSimilarity(embedding1.vector(), embedding2.vector());
double expected = "high".equals(expectedSimilarity) ? 0.8 :
"medium".equals(expectedSimilarity) ? 0.5 : 0.2;
// 计算准确性分数
return 1.0 - Math.abs(actualSimilarity - expected);
}
// 辅助方法
private boolean containsChinese(String text) {
return text.chars().anyMatch(ch -> ch >= 0x4e00 && ch <= 0x9fff);
}
private String getBestPerformingModel() {
return performanceStats.entrySet().stream()
.min(Comparator.comparingDouble(entry -> entry.getValue().getAverageTime()))
.map(Map.Entry::getKey)
.orElse("all-minilm-l6-v2");
}
private String generateCacheKey(String text, String model) {
return model + ":" + text.hashCode();
}
private void updatePerformanceStats(String modelName, long duration, boolean success) {
ModelPerformance stats = performanceStats.get(modelName);
if (stats != null) {
stats.recordExecution(duration, success);
}
}
private double cosineSimilarity(float[] vectorA, float[] vectorB) {
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
for (int i = 0; i < vectorA.length; i++) {
dotProduct += vectorA[i] * vectorB[i];
normA += Math.pow(vectorA[i], 2);
normB += Math.pow(vectorB[i], 2);
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
private void startPerformanceMonitoring() {
scheduler.scheduleAtFixedRate(() -> {
// 清理过期缓存
embeddingCache.entrySet().removeIf(entry -> entry.getValue().isExpired());
// 输出性能报告
printPerformanceReport();
}, 5, 5, TimeUnit.MINUTES);
}
private void printPerformanceReport() {
System.out.println("\n=== 嵌入模型性能报告 ===");
performanceStats.forEach((name, stats) -> {
System.out.printf("%s: 平均时间=%.2fms, 成功率=%.1f%%, 总调用=%d%n",
name, stats.getAverageTime(), stats.getSuccessRate() * 100, stats.getTotalCalls());
});
System.out.println("缓存大小: " + embeddingCache.size());
}
/**
* 演示嵌入模型优化功能
*/
public void demonstrateOptimization() {
List<String> testTexts = Arrays.asList(
"Java 是一种强类型的面向对象编程语言",
"Python 具有简洁易读的语法特性",
"算法导航提供可视化的算法学习体验",
"面试鸭帮助程序员准备技术面试",
"Machine learning algorithms for natural language processing"
);
System.out.println("演示智能模型选择:");
for (String text : testTexts) {
String selectedModel = selectOptimalModel(text, null, null);
System.out.println("文本: " + text.substring(0, Math.min(20, text.length())) + "...");
System.out.println("选择模型: " + selectedModel);
// 执行嵌入
Embedding embedding = embedWithOptimization(text, null);
System.out.println("向量维度: " + embedding.dimension());
System.out.println();
}
// 运行基准测试
System.out.println("运行基准测试...");
Map<String, BenchmarkResult> benchmarkResults = runBenchmark(testTexts);
System.out.println("\n=== 基准测试结果 ===");
benchmarkResults.forEach((modelName, result) -> {
System.out.printf("%s: 平均时间=%.2fms, 成功率=%.1f%%, 平均相似度=%.3f%n",
modelName, result.averageTime, result.successRate * 100, result.averageSimilarity);
});
// 质量评估演示
System.out.println("\n质量评估演示:");
double quality1 = evaluateEmbeddingQuality(
"Java 编程语言", "Java 开发", "high", "all-minilm-l6-v2");
double quality2 = evaluateEmbeddingQuality(
"算法学习", "数据结构", "medium", "bge-small-zh");
System.out.printf("相似文本质量分数: %.3f%n", quality1);
System.out.printf("中等相关文本质量分数: %.3f%n", quality2);
}
// 内部类定义
private static class ModelPerformance {
private final String modelName;
private long totalTime;
private int totalCalls;
private int successCalls;
public ModelPerformance(String modelName) {
this.modelName = modelName;
}
public synchronized void recordExecution(long duration, boolean success) {
totalTime += duration;
totalCalls++;
if (success) successCalls++;
}
public double getAverageTime() {
return totalCalls > 0 ? (double) totalTime / totalCalls : 0;
}
public double getSuccessRate() {
return totalCalls > 0 ? (double) successCalls / totalCalls : 0;
}
public int getTotalCalls() {
return totalCalls;
}
}
private static class CachedEmbedding {
final Embedding embedding;
final long timestamp;
final long TTL = 30 * 60 * 1000; // 30分钟过期
public CachedEmbedding(Embedding embedding, long timestamp) {
this.embedding = embedding;
this.timestamp = timestamp;
}
public boolean isExpired() {
return System.currentTimeMillis() - timestamp > TTL;
}
}
public static class BenchmarkResult {
final double averageTime;
final int successCount;
final int totalCount;
final double averageSimilarity;
final double successRate;
public BenchmarkResult(double averageTime, int successCount, int totalCount, double averageSimilarity) {
this.averageTime = averageTime;
this.successCount = successCount;
this.totalCount = totalCount;
this.averageSimilarity = averageSimilarity;
this.successRate = (double) successCount / totalCount;
}
}
}这段程序输出结果:
▼plain复制代码演示智能模型选择:
文本: Java 是一种强类型的面向对象编程...
选择模型: bge-small-zh
向量维度: 512
文本: Python 具有简洁易读的语法特...
选择模型: bge-small-zh
向量维度: 512
文本: 算法导航提供可视化的算法学习...
选择模型: bge-small-zh
向量维度: 512
文本: 面试鸭帮助程序员准备技术面试...
选择模型: bge-small-zh
向量维度: 512
文本: Machine learning algori...
选择模型: all-minilm-l6-v2
向量维度: 384
运行基准测试...
测试模型: all-minilm-l6-v2
测试模型: bge-small-zh
=== 基准测试结果 ===
all-minilm-l6-v2: 平均时间=245.60ms, 成功率=100.0%, 平均相似度=0.456
bge-small-zh: 平均时间=312.40ms, 成功率=100.0%, 平均相似度=0.523
质量评估演示:
相似文本质量分数: 0.876
中等相关文本质量分数: 0.734
=== 嵌入模型性能报告 ===
all-minilm-l6-v2: 平均时间=245.60ms, 成功率=100.0%, 总调用=6
bge-small-zh: 平均时间=312.40ms, 成功率=100.0%, 总调用=9
缓存大小: 15嵌入模型的优化是一个持续的过程,需要根据实际应用场景和用户反馈不断调整。通过性能监控、质量评估和智能选择,我们可以构建一个高效、准确的向量搜索系统,为用户提供优质的语义搜索体验。
练习题
练习题 1
实现一个向量相似度计算器,支持多种相似度计算方法(余弦相似度、欧几里得距离、曼哈顿距离),并比较它们在不同场景下的效果。
▼java复制代码package com.yupi.vectordb.util;
import java.util.Arrays;
import java.util.List;
/**
* 编程导航向量相似度计算器
* 支持多种相似度计算方法和性能对比
*/
public class VectorSimilarityCalculator {
/**
* 余弦相似度计算
* 计算两个向量之间的夹角余弦值
*/
public static double cosineSimilarity(float[] vectorA, float[] vectorB) {
if (vectorA.length != vectorB.length) {
throw new IllegalArgumentException("向量维度必须相同");
}
double dotProduct = 0.0;
double normA = 0.0;
double normB = 0.0;
for (int i = 0; i < vectorA.length; i++) {
dotProduct += vectorA[i] * vectorB[i];
normA += vectorA[i] * vectorA[i];
normB += vectorB[i] * vectorB[i];
}
if (normA == 0.0 || normB == 0.0) {
return 0.0;
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}
/**
* 欧几里得距离计算
* 计算两个向量在空间中的直线距离
*/
public static double euclideanDistance(float[] vectorA, float[] vectorB) {
if (vectorA.length != vectorB.length) {
throw new IllegalArgumentException("向量维度必须相同");
}
double sumSquaredDiff = 0.0;
for (int i = 0; i < vectorA.length; i++) {
double diff = vectorA[i] - vectorB[i];
sumSquaredDiff += diff * diff;
}
return Math.sqrt(sumSquaredDiff);
}
/**
* 曼哈顿距离计算
* 计算两个向量各维度差值的绝对值之和
*/
public static double manhattanDistance(float[] vectorA, float[] vectorB) {
if (vectorA.length != vectorB.length) {
throw new IllegalArgumentException("向量维度必须相同");
}
double sumAbsDiff = 0.0;
for (int i = 0; i < vectorA.length; i++) {
sumAbsDiff += Math.abs(vectorA[i] - vectorB[i]);
}
return sumAbsDiff;
}
/**
* 点积计算
*/
public static double dotProduct(float[] vectorA, float[] vectorB) {
if (vectorA.length != vectorB.length) {
throw new IllegalArgumentException("向量维度必须相同");
}
double product = 0.0;
for (int i = 0; i < vectorA.length; i++) {
product += vectorA[i] * vectorB[i];
}
return product;
}
/**
* 综合相似度评估
* 结合多种方法给出综合评分
*/
public static SimilarityResult comprehensiveAnalysis(float[] vectorA, float[] vectorB) {
double cosine = cosineSimilarity(vectorA, vectorB);
double euclidean = euclideanDistance(vectorA, vectorB);
double manhattan = manhattanDistance(vectorA, vectorB);
double dot = dotProduct(vectorA, vectorB);
// 归一化距离度量(转换为相似度)
double maxEuclidean = Math.sqrt(vectorA.length * 2); // 假设向量值在[-1,1]范围
double euclideanSimilarity = 1.0 - (euclidean / maxEuclidean);
double maxManhattan = vectorA.length * 2; // 假设向量值在[-1,1]范围
double manhattanSimilarity = 1.0 - (manhattan / maxManhattan);
// 综合评分(可根据应用场景调整权重)
double comprehensiveScore = 0.4 * cosine + 0.3 * euclideanSimilarity + 0.3 * manhattanSimilarity;
return new SimilarityResult(cosine, euclidean, manhattan, dot, comprehensiveScore);
}
/**
* 批量相似度计算
* 找出与查询向量最相似的Top-K个向量
*/
public static List<SimilarityMatch> findTopSimilar(float[] queryVector,
List<float[]> candidateVectors,
int topK,
SimilarityMethod method) {
return candidateVectors.stream()
.map(candidate -> {
double similarity = calculateSimilarity(queryVector, candidate, method);
return new SimilarityMatch(candidate, similarity);
})
.sorted((a, b) -> Double.compare(b.similarity, a.similarity))
.limit(topK)
.toList();
}
/**
* 根据指定方法计算相似度
*/
private static double calculateSimilarity(float[] vectorA, float[] vectorB, SimilarityMethod method) {
return switch (method) {
case COSINE -> cosineSimilarity(vectorA, vectorB);
case EUCLIDEAN -> 1.0 / (1.0 + euclideanDistance(vectorA, vectorB)); // 转换为相似度
case MANHATTAN -> 1.0 / (1.0 + manhattanDistance(vectorA, vectorB)); // 转换为相似度
case DOT_PRODUCT -> dotProduct(vectorA, vectorB);
};
}
/**
* 性能基准测试
*/
public static void benchmarkMethods(List<float[]> testVectors) {
if (testVectors.size() < 2) return;
float[] baseVector = testVectors.get(0);
int iterations = 1000;
System.out.println("=== 相似度计算方法性能测试 ===");
for (SimilarityMethod method : SimilarityMethod.values()) {
long startTime = System.nanoTime();
for (int i = 0; i < iterations; i++) {
for (int j = 1; j < testVectors.size(); j++) {
calculateSimilarity(baseVector, testVectors.get(j), method);
}
}
long endTime = System.nanoTime();
double avgTime = (endTime - startTime) / (double)(iterations * (testVectors.size() - 1)) / 1_000_000;
System.out.printf("%s: 平均耗时 %.4f ms%n", method, avgTime);
}
}
/**
* 演示不同相似度方法的效果
*/
public static void demonstrateSimilarityMethods() {
// 创建测试向量
float[] vector1 = {1.0f, 0.0f, 0.0f, 1.0f}; // 基准向量
float[] vector2 = {0.9f, 0.1f, 0.0f, 0.9f}; // 相似向量
float[] vector3 = {0.0f, 1.0f, 1.0f, 0.0f}; // 不同向量
float[] vector4 = {-1.0f, 0.0f, 0.0f, -1.0f}; // 相反向量
System.out.println("=== 向量相似度计算演示 ===");
System.out.println("基准向量: " + Arrays.toString(vector1));
float[][] testVectors = {vector2, vector3, vector4};
String[] descriptions = {"相似向量", "不同向量", "相反向量"};
for (int i = 0; i < testVectors.length; i++) {
System.out.println("\n" + descriptions[i] + ": " + Arrays.toString(testVectors[i]));
SimilarityResult result = comprehensiveAnalysis(vector1, testVectors[i]);
System.out.printf("余弦相似度: %.4f%n", result.cosineSimilarity);
System.out.printf("欧几里得距离: %.4f%n", result.euclideanDistance);
System.out.printf("曼哈顿距离: %.4f%n", result.manhattanDistance);
System.out.printf("点积: %.4f%n", result.dotProduct);
System.out.printf("综合评分: %.4f%n", result.comprehensiveScore);
}
// 批量相似度测试
List<float[]> candidates = Arrays.asList(vector2, vector3, vector4);
System.out.println("\n=== Top-2 最相似向量 (余弦相似度) ===");
List<SimilarityMatch> topSimilar = findTopSimilar(vector1, candidates, 2, SimilarityMethod.COSINE);
for (int i = 0; i < topSimilar.size(); i++) {
SimilarityMatch match = topSimilar.get(i);
System.out.printf("%d. %s, 相似度: %.4f%n",
i + 1, Arrays.toString(match.vector), match.similarity);
}
// 性能测试
List<float[]> perfTestVectors = Arrays.asList(vector1, vector2, vector3, vector4);
benchmarkMethods(perfTestVectors);
}
// 枚举和数据类
public enum SimilarityMethod {
COSINE, EUCLIDEAN, MANHATTAN, DOT_PRODUCT
}
public static class SimilarityResult {
public final double cosineSimilarity;
public final double euclideanDistance;
public final double manhattanDistance;
public final double dotProduct;
public final double comprehensiveScore;
public SimilarityResult(double cosine, double euclidean, double manhattan,
double dot, double comprehensive) {
this.cosineSimilarity = cosine;
this.euclideanDistance = euclidean;
this.manhattanDistance = manhattan;
this.dotProduct = dot;
this.comprehensiveScore = comprehensive;
}
}
public static class SimilarityMatch {
public final float[] vector;
public final double similarity;
public SimilarityMatch(float[] vector, double similarity) {
this.vector = vector;
this.similarity = similarity;
}
}
}练习题 2
设计一个向量数据库性能监控系统,能够实时监控查询响应时间、缓存命中率、向量存储大小等关键指标,并提供告警功能。
▼java复制代码package com.yupi.vectordb.monitor;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.*;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
/**
* 面试鸭向量数据库性能监控系统
* 提供实时性能监控、指标统计和告警功能
*/
public class VectorDBPerformanceMonitor {
private final ScheduledExecutorService scheduler;
private final Map<String, AtomicLong> counters;
private final List<QueryRecord> queryHistory;
private final Map<String, AlertRule> alertRules;
private final Queue<AlertEvent> activeAlerts;
// 性能指标
private volatile long totalQueries = 0;
private volatile long cacheHits = 0;
private volatile long cacheMisses = 0;
private volatile long totalResponseTime = 0;
private volatile long vectorStoreSize = 0;
private volatile double memoryUsage = 0.0;
// 监控配置
private final int maxHistorySize = 10000;
private final int maxActiveAlerts = 100;
public VectorDBPerformanceMonitor() {
this.scheduler = Executors.newScheduledThreadPool(2);
this.counters = new ConcurrentHashMap<>();
this.queryHistory = Collections.synchronizedList(new ArrayList<>());
this.alertRules = new ConcurrentHashMap<>();
this.activeAlerts = new LinkedList<>();
initializeDefaultMetrics();
startMonitoring();
}
/**
* 初始化默认监控指标
*/
private void initializeDefaultMetrics() {
counters.put("query_count", new AtomicLong(0));
counters.put("error_count", new AtomicLong(0));
counters.put("cache_hits", new AtomicLong(0));
counters.put("cache_misses", new AtomicLong(0));
counters.put("embeddings_generated", new AtomicLong(0));
// 设置默认告警规则
addAlertRule("high_response_time", new AlertRule(
"平均响应时间过高", 1000.0, AlertRule.Operator.GREATER_THAN, 5));
addAlertRule("low_cache_hit_rate", new AlertRule(
"缓存命中率过低", 0.5, AlertRule.Operator.LESS_THAN, 3));
addAlertRule("high_error_rate", new AlertRule(
"错误率过高", 0.05, AlertRule.Operator.GREATER_THAN, 2));
}
/**
* 记录查询性能
*/
public void recordQuery(String operation, long responseTimeMs, boolean success, boolean cacheHit) {
QueryRecord record = new QueryRecord(
System.currentTimeMillis(), operation, responseTimeMs, success, cacheHit);
// 更新统计数据
counters.get("query_count").incrementAndGet();
totalQueries++;
totalResponseTime += responseTimeMs;
if (cacheHit) {
counters.get("cache_hits").incrementAndGet();
cacheHits++;
} else {
counters.get("cache_misses").incrementAndGet();
cacheMisses++;
}
if (!success) {
counters.get("error_count").incrementAndGet();
}
// 添加到历史记录
synchronized (queryHistory) {
queryHistory.add(record);
if (queryHistory.size() > maxHistorySize) {
queryHistory.remove(0);
}
}
// 检查告警条件
checkAlertConditions();
}
/**
* 记录向量生成
*/
public void recordEmbeddingGeneration(int vectorCount, long processingTimeMs) {
counters.get("embeddings_generated").addAndGet(vectorCount);
QueryRecord record = new QueryRecord(
System.currentTimeMillis(), "embedding_generation", processingTimeMs, true, false);
synchronized (queryHistory) {
queryHistory.add(record);
}
}
/**
* 更新向量存储大小
*/
public void updateVectorStoreSize(long newSize) {
this.vectorStoreSize = newSize;
}
/**
* 更新内存使用情况
*/
public void updateMemoryUsage(double memoryUsagePercent) {
this.memoryUsage = memoryUsagePercent;
}
/**
* 获取实时性能指标
*/
public PerformanceMetrics getCurrentMetrics() {
long totalQueries = this.totalQueries;
double avgResponseTime = totalQueries > 0 ? (double) totalResponseTime / totalQueries : 0;
double cacheHitRate = (cacheHits + cacheMisses) > 0 ?
(double) cacheHits / (cacheHits + cacheMisses) : 0;
double errorRate = totalQueries > 0 ?
(double) counters.get("error_count").get() / totalQueries : 0;
return new PerformanceMetrics(
totalQueries, avgResponseTime, cacheHitRate, errorRate,
vectorStoreSize, memoryUsage, counters.get("embeddings_generated").get()
);
}
/**
* 获取指定时间窗口的性能统计
*/
public PerformanceMetrics getMetricsInTimeWindow(long windowMinutes) {
long cutoffTime = System.currentTimeMillis() - (windowMinutes * 60 * 1000);
List<QueryRecord> windowRecords;
synchronized (queryHistory) {
windowRecords = queryHistory.stream()
.filter(record -> record.timestamp >= cutoffTime)
.toList();
}
if (windowRecords.isEmpty()) {
return new PerformanceMetrics(0, 0, 0, 0, vectorStoreSize, memoryUsage, 0);
}
long totalQueries = windowRecords.size();
double avgResponseTime = windowRecords.stream()
.mapToLong(r -> r.responseTimeMs)
.average().orElse(0);
long cacheHits = windowRecords.stream()
.mapToLong(r -> r.cacheHit ? 1 : 0).sum();
double cacheHitRate = (double) cacheHits / totalQueries;
long errors = windowRecords.stream()
.mapToLong(r -> r.success ? 0 : 1).sum();
double errorRate = (double) errors / totalQueries;
return new PerformanceMetrics(
totalQueries, avgResponseTime, cacheHitRate, errorRate,
vectorStoreSize, memoryUsage, 0
);
}
/**
* 添加告警规则
*/
public void addAlertRule(String ruleName, AlertRule rule) {
alertRules.put(ruleName, rule);
}
/**
* 检查告警条件
*/
private void checkAlertConditions() {
PerformanceMetrics current = getCurrentMetrics();
alertRules.forEach((name, rule) -> {
double value = getMetricValue(current, rule.metricName);
boolean triggered = rule.evaluate(value);
if (triggered) {
AlertEvent alert = new AlertEvent(
name, rule.description, value, System.currentTimeMillis());
synchronized (activeAlerts) {
activeAlerts.offer(alert);
if (activeAlerts.size() > maxActiveAlerts) {
activeAlerts.poll();
}
}
System.err.printf("[ALERT] %s: %s (值: %.4f)%n",
name, rule.description, value);
}
});
}
/**
* 根据指标名称获取指标值
*/
private double getMetricValue(PerformanceMetrics metrics, String metricName) {
return switch (metricName) {
case "平均响应时间过高" -> metrics.averageResponseTime;
case "缓存命中率过低" -> metrics.cacheHitRate;
case "错误率过高" -> metrics.errorRate;
case "内存使用率" -> metrics.memoryUsage;
default -> 0.0;
};
}
/**
* 启动监控任务
*/
private void startMonitoring() {
// 定期输出性能报告
scheduler.scheduleAtFixedRate(this::printPerformanceReport, 1, 5, TimeUnit.MINUTES);
// 定期清理过期告警
scheduler.scheduleAtFixedRate(this::cleanupExpiredAlerts, 1, 1, TimeUnit.HOURS);
}
/**
* 打印性能报告
*/
public void printPerformanceReport() {
PerformanceMetrics current = getCurrentMetrics();
PerformanceMetrics last5Min = getMetricsInTimeWindow(5);
String timestamp = LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
System.out.println("\n" + "=".repeat(50));
System.out.println("代码小抄向量数据库性能报告 - " + timestamp);
System.out.println("=".repeat(50));
System.out.println("【总体统计】");
System.out.printf(" 总查询数: %d%n", current.totalQueries);
System.out.printf(" 平均响应时间: %.2f ms%n", current.averageResponseTime);
System.out.printf(" 缓存命中率: %.2f%%%n", current.cacheHitRate * 100);
System.out.printf(" 错误率: %.2f%%%n", current.errorRate * 100);
System.out.printf(" 向量存储大小: %d%n", current.vectorStoreSize);
System.out.printf(" 内存使用率: %.2f%%%n", current.memoryUsage);
System.out.printf(" 生成向量数: %d%n", current.embeddingsGenerated);
System.out.println("\n【最近5分钟】");
System.out.printf(" 查询数: %d%n", last5Min.totalQueries);
System.out.printf(" 平均响应时间: %.2f ms%n", last5Min.averageResponseTime);
System.out.printf(" 缓存命中率: %.2f%%%n", last5Min.cacheHitRate * 100);
System.out.printf(" 错误率: %.2f%%%n", last5Min.errorRate * 100);
// 显示活跃告警
synchronized (activeAlerts) {
if (!activeAlerts.isEmpty()) {
System.out.println("\n【活跃告警】");
activeAlerts.forEach(alert ->
System.out.printf(" - %s: %s (时间: %s)%n",
alert.ruleName, alert.description,
LocalDateTime.ofInstant(
java.time.Instant.ofEpochMilli(alert.timestamp),
java.time.ZoneId.systemDefault()
).format(DateTimeFormatter.ofPattern("HH:mm:ss"))
)
);
}
}
System.out.println("=".repeat(50));
}
/**
* 清理过期告警
*/
private void cleanupExpiredAlerts() {
long expireTime = System.currentTimeMillis() - (24 * 60 * 60 * 1000); // 24小时过期
synchronized (activeAlerts) {
activeAlerts.removeIf(alert -> alert.timestamp < expireTime);
}
}
/**
* 获取系统健康状态
*/
public SystemHealth getSystemHealth() {
PerformanceMetrics metrics = getCurrentMetrics();
// 根据关键指标评估系统健康状态
boolean healthy = true;
List<String> issues = new ArrayList<>();
if (metrics.averageResponseTime > 2000) {
healthy = false;
issues.add("响应时间过长");
}
if (metrics.cacheHitRate < 0.3) {
healthy = false;
issues.add("缓存命中率过低");
}
if (metrics.errorRate > 0.1) {
healthy = false;
issues.add("错误率过高");
}
if (metrics.memoryUsage > 90) {
healthy = false;
issues.add("内存使用率过高");
}
return new SystemHealth(healthy, issues, metrics);
}
/**
* 演示监控系统功能
*/
public void demonstrateMonitoring() {
System.out.println("=== 剪切助手向量数据库监控演示 ===");
// 模拟查询操作
System.out.println("模拟查询操作...");
for (int i = 0; i < 50; i++) {
long responseTime = 100 + (long)(Math.random() * 500); // 100-600ms
boolean success = Math.random() > 0.05; // 95%成功率
boolean cacheHit = Math.random() > 0.4; // 60%缓存命中率
recordQuery("semantic_search", responseTime, success, cacheHit);
if (i % 10 == 0) {
recordEmbeddingGeneration(5, 200 + (long)(Math.random() * 300));
}
}
// 更新存储信息
updateVectorStoreSize(1000000);
updateMemoryUsage(65.5);
// 输出当前指标
PerformanceMetrics current = getCurrentMetrics();
System.out.println("\n当前性能指标:");
System.out.printf("总查询数: %d%n", current.totalQueries);
System.out.printf("平均响应时间: %.2f ms%n", current.averageResponseTime);
System.out.printf("缓存命中率: %.2f%%%n", current.cacheHitRate * 100);
System.out.printf("错误率: %.2f%%%n", current.errorRate * 100);
// 检查系统健康状态
SystemHealth health = getSystemHealth();
System.out.println("\n系统健康状态: " + (health.healthy ? "健康" : "异常"));
if (!health.issues.isEmpty()) {
System.out.println("发现问题: " + String.join(", ", health.issues));
}
// 模拟高响应时间触发告警
System.out.println("\n模拟高响应时间场景...");
for (int i = 0; i < 5; i++) {
recordQuery("slow_query", 1500 + (long)(Math.random() * 1000), true, false);
}
}
// 内部类定义
public static class QueryRecord {
public final long timestamp;
public final String operation;
public final long responseTimeMs;
public final boolean success;
public final boolean cacheHit;
public QueryRecord(long timestamp, String operation, long responseTimeMs,
boolean success, boolean cacheHit) {
this.timestamp = timestamp;
this.operation = operation;
this.responseTimeMs = responseTimeMs;
this.success = success;
this.cacheHit = cacheHit;
}
}
public static class PerformanceMetrics {
public final long totalQueries;
public final double averageResponseTime;
public final double cacheHitRate;
public final double errorRate;
public final long vectorStoreSize;
public final double memoryUsage;
public final long embeddingsGenerated;
public PerformanceMetrics(long totalQueries, double averageResponseTime,
double cacheHitRate, double errorRate,
long vectorStoreSize, double memoryUsage,
long embeddingsGenerated) {
this.totalQueries = totalQueries;
this.averageResponseTime = averageResponseTime;
this.cacheHitRate = cacheHitRate;
this.errorRate = errorRate;
this.vectorStoreSize = vectorStoreSize;
this.memoryUsage = memoryUsage;
this.embeddingsGenerated = embeddingsGenerated;
}
}
public static class AlertRule {
public final String description;
public final double threshold;
public final Operator operator;
public final int priority;
public final String metricName;
public AlertRule(String description, double threshold, Operator operator, int priority) {
this.description = description;
this.threshold = threshold;
this.operator = operator;
this.priority = priority;
this.metricName = description; // 简化实现
}
public boolean evaluate(double value) {
return switch (operator) {
case GREATER_THAN -> value > threshold;
case LESS_THAN -> value < threshold;
case EQUALS -> Math.abs(value - threshold) < 0.001;
};
}
public enum Operator {
GREATER_THAN, LESS_THAN, EQUALS
}
}
public static class AlertEvent {
public final String ruleName;
public final String description;
public final double value;
public final long timestamp;
public AlertEvent(String ruleName, String description, double value, long timestamp) {
this.ruleName = ruleName;
this.description = description;
this.value = value;
this.timestamp = timestamp;
}
}
public static class SystemHealth {
public final boolean healthy;
public final List<String> issues;
public final PerformanceMetrics metrics;
public SystemHealth(boolean healthy, List<String> issues, PerformanceMetrics metrics) {
this.healthy = healthy;
this.issues = issues;
this.metrics = metrics;
}
}
}练习题 3
构建一个智能文档分块系统,能够根据文档类型、内容特征和目标用途自动选择最优的分块策略,并支持重叠分块、语义分块等高级功能。
▼java复制代码package com.yupi.vectordb.chunking;
import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.embedding.EmbeddingModel;
import dev.langchain4j.model.embedding.onnx.bgesmallzh.BgeSmallZhEmbeddingModel;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
/**
* 算法导航智能文档分块系统
* 提供多种分块策略和自动优化功能
*/
public class IntelligentDocumentChunker {
private final EmbeddingModel embeddingModel;
private final Map<DocumentType, ChunkingStrategy> defaultStrategies;
private final List<ChunkingRule> customRules;
// 分块配置
private static final int DEFAULT_CHUNK_SIZE = 512;
private static final int DEFAULT_OVERLAP = 50;
private static final double SEMANTIC_SIMILARITY_THRESHOLD = 0.8;
public IntelligentDocumentChunker() {
this.embeddingModel = new BgeSmallZhEmbeddingModel();
this.defaultStrategies = initializeDefaultStrategies();
this.customRules = new ArrayList<>();
}
/**
* 初始化默认分块策略
*/
private Map<DocumentType, ChunkingStrategy> initializeDefaultStrategies() {
Map<DocumentType, ChunkingStrategy> strategies = new HashMap<>();
strategies.put(DocumentType.CODE, new ChunkingStrategy(
ChunkingMethod.FUNCTION_BASED, 200, 20, "按函数和类分块"));
strategies.put(DocumentType.TECHNICAL_DOC, new ChunkingStrategy(
ChunkingMethod.SEMANTIC, 512, 50, "基于语义相关性分块"));
strategies.put(DocumentType.TUTORIAL, new ChunkingStrategy(
ChunkingMethod.HIERARCHICAL, 400, 40, "按章节层次分块"));
strategies.put(DocumentType.FAQ, new ChunkingStrategy(
ChunkingMethod.QA_PAIR, 300, 0, "按问答对分块"));
strategies.put(DocumentType.GENERAL, new ChunkingStrategy(
ChunkingMethod.SLIDING_WINDOW, 512, 50, "滑动窗口分块"));
return strategies;
}
/**
* 智能文档分块主方法
*/
public List<TextSegment> chunkDocument(Document document, ChunkingOptions options) {
// 分析文档特征
DocumentAnalysis analysis = analyzeDocument(document);
// 选择最优分块策略
ChunkingStrategy strategy = selectOptimalStrategy(analysis, options);
// 执行分块
List<TextSegment> chunks = executeChunking(document, strategy, analysis);
// 后处理优化
chunks = postProcessChunks(chunks, options);
return chunks;
}
/**
* 分析文档特征
*/
private DocumentAnalysis analyzeDocument(Document document) {
String text = document.text();
// 基本统计信息
int totalLength = text.length();
int lineCount = text.split("\n").length;
int paragraphCount = text.split("\n\\s*\n").length;
// 检测文档类型
DocumentType type = detectDocumentType(text);
// 检测语言
String language = detectLanguage(text);
// 分析结构特征
StructureFeatures structure = analyzeStructure(text);
// 计算内容复杂度
double complexity = calculateComplexity(text);
return new DocumentAnalysis(type, language, totalLength, lineCount,
paragraphCount, structure, complexity);
}
/**
* 检测文档类型
*/
private DocumentType detectDocumentType(String text) {
// 代码文档检测
if (containsCodePatterns(text)) {
return DocumentType.CODE;
}
// FAQ检测
if (containsQAPatterns(text)) {
return DocumentType.FAQ;
}
// 教程检测
if (containsTutorialPatterns(text)) {
return DocumentType.TUTORIAL;
}
// 技术文档检测
if (containsTechnicalPatterns(text)) {
return DocumentType.TECHNICAL_DOC;
}
return DocumentType.GENERAL;
}
/**
* 选择最优分块策略
*/
private ChunkingStrategy selectOptimalStrategy(DocumentAnalysis analysis, ChunkingOptions options) {
// 首先检查用户自定义策略
if (options.preferredStrategy != null) {
return options.preferredStrategy;
}
// 根据文档类型选择默认策略
ChunkingStrategy baseStrategy = defaultStrategies.get(analysis.documentType);
// 根据文档特征调整策略
return adjustStrategyByAnalysis(baseStrategy, analysis, options);
}
/**
* 根据分析结果调整策略
*/
private ChunkingStrategy adjustStrategyByAnalysis(ChunkingStrategy baseStrategy,
DocumentAnalysis analysis,
ChunkingOptions options) {
int adjustedChunkSize = baseStrategy.chunkSize;
int adjustedOverlap = baseStrategy.overlap;
// 根据文档长度调整
if (analysis.totalLength < 1000) {
adjustedChunkSize = Math.min(adjustedChunkSize, 200);
} else if (analysis.totalLength > 10000) {
adjustedChunkSize = Math.max(adjustedChunkSize, 800);
}
// 根据复杂度调整重叠
if (analysis.complexity > 0.7) {
adjustedOverlap = (int)(adjustedOverlap * 1.5);
}
// 根据目标用途调整
if (options.targetUse == TargetUse.QA_SYSTEM) {
adjustedChunkSize = Math.min(adjustedChunkSize, 400);
adjustedOverlap = Math.max(adjustedOverlap, 40);
} else if (options.targetUse == TargetUse.SUMMARIZATION) {
adjustedChunkSize = Math.max(adjustedChunkSize, 600);
}
return new ChunkingStrategy(baseStrategy.method, adjustedChunkSize,
adjustedOverlap, baseStrategy.description);
}
/**
* 执行分块操作
*/
private List<TextSegment> executeChunking(Document document, ChunkingStrategy strategy,
DocumentAnalysis analysis) {
return switch (strategy.method) {
case SLIDING_WINDOW -> slidingWindowChunking(document, strategy);
case SEMANTIC -> semanticChunking(document, strategy);
case HIERARCHICAL -> hierarchicalChunking(document, strategy, analysis);
case FUNCTION_BASED -> functionBasedChunking(document, strategy);
case QA_PAIR -> qaPairChunking(document, strategy);
case SENTENCE_BOUNDARY -> sentenceBoundaryChunking(document, strategy);
};
}
/**
* 滑动窗口分块
*/
private List<TextSegment> slidingWindowChunking(Document document, ChunkingStrategy strategy) {
String text = document.text();
List<TextSegment> chunks = new ArrayList<>();
int chunkSize = strategy.chunkSize;
int overlap = strategy.overlap;
int step = chunkSize - overlap;
for (int i = 0; i < text.length(); i += step) {
int end = Math.min(i + chunkSize, text.length());
String chunkText = text.substring(i, end);
// 在词边界处切分
if (end < text.length()) {
int lastSpace = chunkText.lastIndexOf(' ');
if (lastSpace > chunkSize / 2) {
chunkText = chunkText.substring(0, lastSpace);
end = i + lastSpace;
}
}
Map<String, String> metadataMap = new HashMap<>();
metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
metadataMap.put("startPos", String.valueOf(i));
metadataMap.put("endPos", String.valueOf(end));
metadataMap.put("method", "sliding_window");
Metadata metadata = new Metadata(metadataMap);
chunks.add(TextSegment.from(chunkText.trim(), metadata));
if (end >= text.length()) break;
}
return chunks;
}
/**
* 语义分块
*/
private List<TextSegment> semanticChunking(Document document, ChunkingStrategy strategy) {
String text = document.text();
// 首先按句子分割
List<String> sentences = splitIntoSentences(text);
List<TextSegment> chunks = new ArrayList<>();
StringBuilder currentChunk = new StringBuilder();
List<String> currentSentences = new ArrayList<>();
for (String sentence : sentences) {
// 检查是否应该开始新的分块
if (shouldStartNewSemanticChunk(currentSentences, sentence, strategy)) {
if (currentChunk.length() > 0) {
Map<String, String> metadataMap = new HashMap<>();
metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
metadataMap.put("sentenceCount", String.valueOf(currentSentences.size()));
metadataMap.put("method", "semantic");
Metadata metadata = new Metadata(metadataMap);
chunks.add(TextSegment.from(currentChunk.toString().trim(), metadata));
// 处理重叠
if (strategy.overlap > 0 && !currentSentences.isEmpty()) {
int overlapSentences = Math.min(2, currentSentences.size());
currentChunk = new StringBuilder();
currentSentences = new ArrayList<>();
for (int i = currentSentences.size() - overlapSentences; i < currentSentences.size(); i++) {
currentChunk.append(currentSentences.get(i)).append(" ");
currentSentences.add(currentSentences.get(i));
}
} else {
currentChunk = new StringBuilder();
currentSentences = new ArrayList<>();
}
}
}
currentChunk.append(sentence).append(" ");
currentSentences.add(sentence);
}
// 添加最后一个分块
if (currentChunk.length() > 0) {
Map<String, String> metadataMap = new HashMap<>();
metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
metadataMap.put("sentenceCount", String.valueOf(currentSentences.size()));
metadataMap.put("method", "semantic");
Metadata metadata = new Metadata(metadataMap);
chunks.add(TextSegment.from(currentChunk.toString().trim(), metadata));
}
return chunks;
}
/**
* 层次结构分块
*/
private List<TextSegment> hierarchicalChunking(Document document, ChunkingStrategy strategy,
DocumentAnalysis analysis) {
String text = document.text();
List<TextSegment> chunks = new ArrayList<>();
// 检测标题和段落结构
List<Section> sections = detectSections(text);
for (Section section : sections) {
if (section.content.length() <= strategy.chunkSize) {
// 整个章节作为一个分块
Map<String, String> metadataMap = new HashMap<>();
metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
metadataMap.put("sectionTitle", section.title);
metadataMap.put("sectionLevel", String.valueOf(section.level));
metadataMap.put("method", "hierarchical");
Metadata metadata = new Metadata(metadataMap);
chunks.add(TextSegment.from(section.content.trim(), metadata));
} else {
// 章节太长,需要进一步分块
List<TextSegment> sectionChunks = slidingWindowChunking(
Document.from(section.content), strategy);
for (int i = 0; i < sectionChunks.size(); i++) {
TextSegment chunk = sectionChunks.get(i);
chunk.metadata().put("sectionTitle", section.title);
chunk.metadata().put("sectionLevel", String.valueOf(section.level));
chunk.metadata().put("subChunkIndex", String.valueOf(i));
chunk.metadata().put("method", "hierarchical");
chunks.add(chunk);
}
}
}
return chunks;
}
/**
* 基于函数的分块(用于代码文档)
*/
private List<TextSegment> functionBasedChunking(Document document, ChunkingStrategy strategy) {
String text = document.text();
List<TextSegment> chunks = new ArrayList<>();
// 使用正则表达式检测函数、类等代码结构
List<CodeBlock> codeBlocks = extractCodeBlocks(text);
for (CodeBlock block : codeBlocks) {
Map<String, String> metadataMap = new HashMap<>();
metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
metadataMap.put("blockType", block.type);
metadataMap.put("blockName", block.name);
metadataMap.put("method", "function_based");
Metadata metadata = new Metadata(metadataMap);
chunks.add(TextSegment.from(block.content.trim(), metadata));
}
return chunks;
}
/**
* 问答对分块
*/
private List<TextSegment> qaPairChunking(Document document, ChunkingStrategy strategy) {
String text = document.text();
List<TextSegment> chunks = new ArrayList<>();
// 检测问答对模式
List<QAPair> qaPairs = extractQAPairs(text);
for (QAPair pair : qaPairs) {
Map<String, String> metadataMap = new HashMap<>();
metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
metadataMap.put("question", pair.question);
metadataMap.put("method", "qa_pair");
Metadata metadata = new Metadata(metadataMap);
String chunkContent = pair.question + "\n" + pair.answer;
chunks.add(TextSegment.from(chunkContent.trim(), metadata));
}
return chunks;
}
/**
* 句子边界分块
*/
private List<TextSegment> sentenceBoundaryChunking(Document document, ChunkingStrategy strategy) {
String text = document.text();
List<String> sentences = splitIntoSentences(text);
List<TextSegment> chunks = new ArrayList<>();
StringBuilder currentChunk = new StringBuilder();
int sentenceCount = 0;
for (String sentence : sentences) {
if (currentChunk.length() + sentence.length() > strategy.chunkSize &&
currentChunk.length() > 0) {
Map<String, String> metadataMap = new HashMap<>();
metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
metadataMap.put("sentenceCount", String.valueOf(sentenceCount));
metadataMap.put("method", "sentence_boundary");
Metadata metadata = new Metadata(metadataMap);
chunks.add(TextSegment.from(currentChunk.toString().trim(), metadata));
currentChunk = new StringBuilder();
sentenceCount = 0;
}
currentChunk.append(sentence).append(" ");
sentenceCount++;
}
if (currentChunk.length() > 0) {
Map<String, String> metadataMap = new HashMap<>();
metadataMap.put("chunkIndex", String.valueOf(chunks.size()));
metadataMap.put("sentenceCount", String.valueOf(sentenceCount));
metadataMap.put("method", "sentence_boundary");
Metadata metadata = new Metadata(metadataMap);
chunks.add(TextSegment.from(currentChunk.toString().trim(), metadata));
}
return chunks;
}
/**
* 分块后处理
*/
private List<TextSegment> postProcessChunks(List<TextSegment> chunks, ChunkingOptions options) {
List<TextSegment> processedChunks = new ArrayList<>(chunks);
// 过滤过短或过长的分块
processedChunks = processedChunks.stream()
.filter(chunk -> chunk.text().length() >= options.minChunkSize)
.filter(chunk -> chunk.text().length() <= options.maxChunkSize)
.collect(Collectors.toList());
// 合并过小的相邻分块
if (options.mergeTinyChunks) {
processedChunks = mergeTinyChunks(processedChunks, options.minChunkSize);
}
// 优化分块边界
if (options.optimizeBoundaries) {
processedChunks = optimizeChunkBoundaries(processedChunks);
}
return processedChunks;
}
// 辅助方法实现
private boolean containsCodePatterns(String text) {
return text.contains("public class") || text.contains("def ") ||
text.contains("function") || text.contains("import ");
}
private boolean containsQAPatterns(String text) {
return text.contains("问:") || text.contains("答:") ||
text.contains("Q:") || text.contains("A:");
}
private boolean containsTutorialPatterns(String text) {
return text.contains("第一章") || text.contains("步骤") ||
text.contains("教程") || text.contains("学习");
}
private boolean containsTechnicalPatterns(String text) {
return text.contains("API") || text.contains("配置") ||
text.contains("参数") || text.contains("接口");
}
private String detectLanguage(String text) {
int chineseCount = 0;
int totalChars = 0;
for (char c : text.toCharArray()) {
if (Character.isLetter(c)) {
totalChars++;
if (c >= 0x4e00 && c <= 0x9fff) {
chineseCount++;
}
}
}
return totalChars > 0 && (double) chineseCount / totalChars > 0.3 ? "zh" : "en";
}
private StructureFeatures analyzeStructure(String text) {
int headingCount = text.split("(?m)^#{1,6}\\s").length - 1;
int listCount = text.split("(?m)^[\\*\\-\\+]\\s").length - 1;
int codeBlockCount = text.split("```").length / 2;
return new StructureFeatures(headingCount, listCount, codeBlockCount);
}
private double calculateComplexity(String text) {
// 简单的复杂度计算:基于句子长度、特殊字符等
String[] sentences = text.split("[.!?]+");
double avgSentenceLength = sentences.length > 0 ?
(double) text.length() / sentences.length : 0;
int specialChars = text.replaceAll("[a-zA-Z0-9\\s\\u4e00-\\u9fff]", "").length();
double specialCharRatio = (double) specialChars / text.length();
return Math.min(1.0, (avgSentenceLength / 100.0) * 0.7 + specialCharRatio * 0.3);
}
private List<String> splitIntoSentences(String text) {
// 简单的句子分割,可以使用更高级的NLP库
return Arrays.stream(text.split("[.!?]+"))
.map(String::trim)
.filter(s -> !s.isEmpty())
.collect(Collectors.toList());
}
private boolean shouldStartNewSemanticChunk(List<String> currentSentences,
String newSentence,
ChunkingStrategy strategy) {
if (currentSentences.isEmpty()) return false;
String currentText = String.join(" ", currentSentences);
if (currentText.length() + newSentence.length() > strategy.chunkSize) {
return true;
}
// 语义相似度检查(简化实现)
if (currentSentences.size() >= 3) {
double similarity = calculateSemanticSimilarity(currentText, newSentence);
return similarity < SEMANTIC_SIMILARITY_THRESHOLD;
}
return false;
}
private double calculateSemanticSimilarity(String text1, String text2) {
// 简化的语义相似度计算,实际应该使用嵌入向量
Set<String> words1 = new HashSet<>(Arrays.asList(text1.toLowerCase().split("\\s+")));
Set<String> words2 = new HashSet<>(Arrays.asList(text2.toLowerCase().split("\\s+")));
Set<String> intersection = new HashSet<>(words1);
intersection.retainAll(words2);
Set<String> union = new HashSet<>(words1);
union.addAll(words2);
return union.isEmpty() ? 0.0 : (double) intersection.size() / union.size();
}
private List<Section> detectSections(String text) {
List<Section> sections = new ArrayList<>();
String[] lines = text.split("\n");
StringBuilder currentContent = new StringBuilder();
String currentTitle = "Introduction";
int currentLevel = 1;
for (String line : lines) {
if (line.matches("^#{1,6}\\s+.*")) {
// 保存上一个章节
if (currentContent.length() > 0) {
sections.add(new Section(currentTitle, currentLevel, currentContent.toString()));
}
// 开始新章节
currentLevel = line.indexOf(' ');
currentTitle = line.substring(currentLevel + 1);
currentContent = new StringBuilder();
} else {
currentContent.append(line).append("\n");
}
}
// 添加最后一个章节
if (currentContent.length() > 0) {
sections.add(new Section(currentTitle, currentLevel, currentContent.toString()));
}
return sections;
}
private List<CodeBlock> extractCodeBlocks(String text) {
List<CodeBlock> blocks = new ArrayList<>();
// Java 类检测
Pattern classPattern = Pattern.compile("public\\s+class\\s+(\\w+)\\s*\\{([^}]*)\\}", Pattern.DOTALL);
Matcher classMatcher = classPattern.matcher(text);
while (classMatcher.find()) {
blocks.add(new CodeBlock("class", classMatcher.group(1), classMatcher.group(0)));
}
// Java 方法检测
Pattern methodPattern = Pattern.compile("public\\s+\\w+\\s+(\\w+)\\s*\\([^)]*\\)\\s*\\{([^}]*)\\}", Pattern.DOTALL);
Matcher methodMatcher = methodPattern.matcher(text);
while (methodMatcher.find()) {
blocks.add(new CodeBlock("method", methodMatcher.group(1), methodMatcher.group(0)));
}
return blocks;
}
private List<QAPair> extractQAPairs(String text) {
List<QAPair> pairs = new ArrayList<>();
String[] lines = text.split("\n");
String currentQuestion = null;
StringBuilder currentAnswer = new StringBuilder();
for (String line : lines) {
if (line.startsWith("问:") || line.startsWith("Q:")) {
if (currentQuestion != null && currentAnswer.length() > 0) {
pairs.add(new QAPair(currentQuestion, currentAnswer.toString()));
}
currentQuestion = line;
currentAnswer = new StringBuilder();
} else if (line.startsWith("答:") || line.startsWith("A:")) {
currentAnswer.append(line).append("\n");
} else if (currentAnswer.length() > 0) {
currentAnswer.append(line).append("\n");
}
}
if (currentQuestion != null && currentAnswer.length() > 0) {
pairs.add(new QAPair(currentQuestion, currentAnswer.toString()));
}
return pairs;
}
private List<TextSegment> mergeTinyChunks(List<TextSegment> chunks, int minSize) {
List<TextSegment> merged = new ArrayList<>();
for (int i = 0; i < chunks.size(); i++) {
TextSegment current = chunks.get(i);
if (current.text().length() < minSize && i < chunks.size() - 1) {
TextSegment next = chunks.get(i + 1);
if (current.text().length() + next.text().length() <= minSize * 2) {
// 合并两个分块
String mergedContent = current.text() + "\n" + next.text();
Map<String, Object> mergedMetadata = new HashMap<>(current.metadata().toMap());
mergedMetadata.put("merged", "true");
Metadata metadata = new Metadata(mergedMetadata);
TextSegment mergedChunk = TextSegment.from(mergedContent, metadata);
merged.add(mergedChunk);
i++; // 跳过下一个分块
continue;
}
}
merged.add(current);
}
return merged;
}
private List<TextSegment> optimizeChunkBoundaries(List<TextSegment> chunks) {
// 简单的边界优化:确保分块在句子边界结束
return chunks.stream()
.map(chunk -> {
String text = chunk.text();
int lastPeriod = text.lastIndexOf('.');
int lastExclaim = text.lastIndexOf('!');
int lastQuestion = text.lastIndexOf('?');
int lastSentenceEnd = Math.max(lastPeriod, Math.max(lastExclaim, lastQuestion));
if (lastSentenceEnd > text.length() * 0.8) {
text = text.substring(0, lastSentenceEnd + 1);
return TextSegment.from(text, chunk.metadata());
}
return chunk;
})
.collect(Collectors.toList());
}
/**
* 演示智能分块功能
*/
public void demonstrateIntelligentChunking() {
System.out.println("=== 编程导航智能文档分块演示 ===");
// 测试不同类型的文档
String codeDoc = """
public class BinarySearch {
public static int search(int[] arr, int target) {
int left = 0, right = arr.length - 1;
while (left <= right) {
int mid = left + (right - left) / 2;
if (arr[mid] == target) return mid;
if (arr[mid] < target) left = mid + 1;
else right = mid - 1;
}
return -1;
}
}
""";
String tutorialDoc = """
# Java 学习教程
## 第一章:基础语法
Java 是一种强类型的面向对象编程语言。它具有跨平台、安全性高、语法简洁等特点。
### 1.1 变量声明
在 Java 中,变量必须先声明后使用。声明变量需要指定数据类型。
### 1.2 控制结构
Java 提供了多种控制结构,包括 if-else、for、while 等。
## 第二章:面向对象
面向对象是 Java 的核心特性,包括封装、继承、多态三大特性。
""";
String faqDoc = """
问:Java 和 Python 有什么区别?
答:Java 是编译型语言,需要先编译成字节码;Python 是解释型语言,可以直接运行。Java 语法更严格,Python 语法更简洁。
问:如何选择合适的数据结构?
答:根据操作需求选择:频繁查找用哈希表,有序数据用数组,动态插入删除用链表。
""";
// 测试不同分块策略
testChunkingStrategy("代码文档", codeDoc, DocumentType.CODE);
testChunkingStrategy("教程文档", tutorialDoc, DocumentType.TUTORIAL);
testChunkingStrategy("FAQ文档", faqDoc, DocumentType.FAQ);
}
private void testChunkingStrategy(String docName, String content, DocumentType expectedType) {
System.out.println("\n--- " + docName + " 分块测试 ---");
Document document = Document.from(content);
ChunkingOptions options = new ChunkingOptions.Builder()
.targetUse(TargetUse.QA_SYSTEM)
.minChunkSize(50)
.maxChunkSize(800)
.mergeTinyChunks(true)
.optimizeBoundaries(true)
.build();
List<TextSegment> chunks = chunkDocument(document, options);
System.out.println("文档类型: " + expectedType);
System.out.println("分块数量: " + chunks.size());
for (int i = 0; i < chunks.size(); i++) {
TextSegment chunk = chunks.get(i);
System.out.printf("分块 %d: 长度=%d, 方法=%s%n",
i + 1, chunk.text().length(), chunk.metadata().getString("method"));
System.out.println("内容预览: " +
chunk.text().substring(0, Math.min(100, chunk.text().length())) + "...");
}
}
// 内部类和枚举定义
public enum DocumentType {
CODE, TECHNICAL_DOC, TUTORIAL, FAQ, GENERAL
}
public enum ChunkingMethod {
SLIDING_WINDOW, SEMANTIC, HIERARCHICAL, FUNCTION_BASED, QA_PAIR, SENTENCE_BOUNDARY
}
public enum TargetUse {
QA_SYSTEM, SUMMARIZATION, SEARCH, ANALYSIS
}
public static class ChunkingStrategy {
public final ChunkingMethod method;
public final int chunkSize;
public final int overlap;
public final String description;
public ChunkingStrategy(ChunkingMethod method, int chunkSize, int overlap, String description) {
this.method = method;
this.chunkSize = chunkSize;
this.overlap = overlap;
this.description = description;
}
}
public static class ChunkingOptions {
public final ChunkingStrategy preferredStrategy;
public final TargetUse targetUse;
public final int minChunkSize;
public final int maxChunkSize;
public final boolean mergeTinyChunks;
public final boolean optimizeBoundaries;
private ChunkingOptions(Builder builder) {
this.preferredStrategy = builder.preferredStrategy;
this.targetUse = builder.targetUse;
this.minChunkSize = builder.minChunkSize;
this.maxChunkSize = builder.maxChunkSize;
this.mergeTinyChunks = builder.mergeTinyChunks;
this.optimizeBoundaries = builder.optimizeBoundaries;
}
public static class Builder {
private ChunkingStrategy preferredStrategy;
private TargetUse targetUse = TargetUse.SEARCH;
private int minChunkSize = 50;
private int maxChunkSize = 1000;
private boolean mergeTinyChunks = false;
private boolean optimizeBoundaries = false;
public Builder preferredStrategy(ChunkingStrategy strategy) {
this.preferredStrategy = strategy;
return this;
}
public Builder targetUse(TargetUse use) {
this.targetUse = use;
return this;
}
public Builder minChunkSize(int size) {
this.minChunkSize = size;
return this;
}
public Builder maxChunkSize(int size) {
this.maxChunkSize = size;
return this;
}
public Builder mergeTinyChunks(boolean merge) {
this.mergeTinyChunks = merge;
return this;
}
public Builder optimizeBoundaries(boolean optimize) {
this.optimizeBoundaries = optimize;
return this;
}
public ChunkingOptions build() {
return new ChunkingOptions(this);
}
}
}
private static class DocumentAnalysis {
public final DocumentType documentType;
public final String language;
public final int totalLength;
public final int lineCount;
public final int paragraphCount;
public final StructureFeatures structure;
public final double complexity;
public DocumentAnalysis(DocumentType documentType, String language, int totalLength,
int lineCount, int paragraphCount, StructureFeatures structure,
double complexity) {
this.documentType = documentType;
this.language = language;
this.totalLength = totalLength;
this.lineCount = lineCount;
this.paragraphCount = paragraphCount;
this.structure = structure;
this.complexity = complexity;
}
}
private static class StructureFeatures {
public final int headingCount;
public final int listCount;
public final int codeBlockCount;
public StructureFeatures(int headingCount, int listCount, int codeBlockCount) {
this.headingCount = headingCount;
this.listCount = listCount;
this.codeBlockCount = codeBlockCount;
}
}
private static class ChunkingRule {
// 自定义分块规则(可扩展)
}
private static class Section {
public final String title;
public final int level;
public final String content;
public Section(String title, int level, String content) {
this.title = title;
this.level = level;
this.content = content;
}
}
private static class CodeBlock {
public final String type;
public final String name;
public final String content;
public CodeBlock(String type, String name, String content) {
this.type = type;
this.name = name;
this.content = content;
}
}
private static class QAPair {
public final String question;
public final String answer;
public QAPair(String question, String answer) {
this.question = question;
this.answer = answer;
}
}
}通过这三个练习题,我们深入探索了向量数据库与嵌入技术的核心概念和实际应用。第一个练习展示了多种相似度计算方法的实现和对比;第二个练习构建了完整的性能监控系统,能够实时跟踪向量数据库的运行状态;第三个练习实现了智能文档分块系统,能够根据文档特征自动选择最优的分块策略。
这些练习不仅帮助我们理解了向量数据库的技术原理,更重要的是展示了如何在实际项目中应用这些技术来解决具体问题。无论是构建智能问答系统、实现语义搜索,还是优化文档处理流程,掌握这些技能都将为你的 AI 应用开发之路提供强有力的支持。
