當(dāng)前位置：首頁 > 物聯(lián)網(wǎng) > 智能應(yīng)用

通過智能分塊和元數(shù)據(jù)集成獲得更好的搜索結(jié)果

時間：2024-10-13 20:06:56

關(guān)鍵字：智能分塊元數(shù)據(jù)集成

手機看文章

掃描二維碼
隨時隨地手機看文章

[導(dǎo)讀]通常，我們開發(fā)基于 LLM 的檢索應(yīng)用程序的知識庫包含大量各種格式的數(shù)據(jù)。為了向LLM提供最相關(guān)的上下文來回答知識庫中特定部分的問題，我們依賴于對知識庫中的文本進行分塊并將其放在方便的位置。

通常，我們開發(fā)基于 LLM 的檢索應(yīng)用程序的知識庫包含大量各種格式的數(shù)據(jù)。為了向LLM提供最相關(guān)的上下文來回答知識庫中特定部分的問題，我們依賴于對知識庫中的文本進行分塊并將其放在方便的位置。

分塊

分塊是將文本分割成有意義的單元以改進信息檢索的過程。通過確保每個塊代表一個集中的想法或觀點，分塊有助于保持內(nèi)容的上下文完整性。

在本文中，我們將討論分塊的三個方面：

· 糟糕的分塊如何導(dǎo)致結(jié)果相關(guān)性降低

· 良好的分塊如何帶來更好的結(jié)果

· 如何通過元數(shù)據(jù)進行良好的分塊，從而獲得具有良好語境的結(jié)果

為了有效地展示分塊的重要性，我們將采用同一段文本，對其應(yīng)用 3 種不同的分塊方法，并檢查如何根據(jù)查詢檢索信息。

分塊并存儲至 Qdrant

讓我們看看下面的代碼，它展示了對同一文本進行分塊的三種不同方法。

Python

import qdrant_client

from qdrant_client.models import PointStruct, Distance, VectorParams

import openai

import yaml

# Load configuration

with open('config.yaml', 'r') as file:

config = yaml.safe_load(file)

# Initialize Qdrant client

client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])

# Initialize OpenAI with the API key

openai.api_key = config['openai']['api_key']

def embed_text(text):

print(f"Generating embedding for: '{text[:50]}'...") # Show a snippet of the text being embedded

response = openai.embeddings.create(

input=[text], # Input needs to be a list

model=config['openai']['model_name']

)

embedding = response.data[0].embedding # Access using the attribute, not as a dictionary

print(f"Generated embedding of length {len(embedding)}.") # Confirm embedding generation

return embedding

# Function to create a collection if it doesn't exist

def create_collection_if_not_exists(collection_name, vector_size):

collections = client.get_collections().collections

if collection_name not in [collection.name for collection in collections]:

client.create_collection(

collection_name=collection_name,

vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)

)

print(f"Created collection: {collection_name} with vector size: {vector_size}") # Collection creation

else:

print(f"Collection {collection_name} already exists.") # Collection existence check

# Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example.

text = """

Artificial intelligence is transforming industries across the globe. One of the key areas where AI is making a significant impact is healthcare. AI is being used to develop new drugs, personalize treatment plans, and even predict patient outcomes. Despite these advancements, there are challenges that must be addressed. The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. As AI continues to evolve, it is crucial that these challenges are not overlooked. By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

"""

# Poor Chunking Strategy

def poor_chunking(text, chunk_size=40):

chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}") # Show chunks produced

return chunks

# Good Chunking Strategy

def good_chunking(text):

import re

sentences = re.split(r'(?<=[.!?]) +', text)

print(f"Good Chunking produced {len(sentences)} chunks: {sentences}") # Show chunks produced

return sentences

# Good Chunking with Metadata

def good_chunking_with_metadata(text):

chunks = good_chunking(text)

metadata_chunks = []

for chunk in chunks:

if "healthcare" in chunk:

metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"})

elif "ethical implications" in chunk or "data privacy" in chunk:

metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"})

else:

metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"})

print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}") # Show chunks produced

return metadata_chunks

# Store chunks in Qdrant

def store_chunks(chunks, collection_name):

if len(chunks) == 0:

print(f"No chunks were generated for the collection '{collection_name}'.")

return

# Generate embedding for the first chunk to determine vector size

sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"]

sample_embedding = embed_text(sample_text)

vector_size = len(sample_embedding)

create_collection_if_not_exists(collection_name, vector_size)

for idx, chunk in enumerate(chunks):

text = chunk if isinstance(chunk, str) else chunk["text"]

embedding = embed_text(text)

payload = chunk if isinstance(chunk, dict) else {"text": text} # Always ensure there's text in the payload

client.upsert(collection_name=collection_name, points=[

PointStruct(id=idx, vector=embedding, payload=payload)

])

print(f"Chunks successfully stored in the collection '{collection_name}'.")

# Execute chunking and storing separately for each strategy

print("Starting poor_chunking...")

store_chunks(poor_chunking(text), "poor_chunking")

print("Starting good_chunking...")

store_chunks(good_chunking(text), "good_chunking")

print("Starting good_chunking_with_metadata...")

store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata")

上面的代碼執(zhí)行以下操作：

· embed_text方法接收文本，使用 OpenAI 嵌入模型生成嵌入，并返回生成的嵌入。

· 初始化用于分塊和后續(xù)內(nèi)容檢索的文本字符串

· 糟糕的分塊策略：將文本分成每 40 個字符的塊

· 良好的分塊策略：根據(jù)句子拆分文本以獲得更有意義的上下文

· 具有元數(shù)據(jù)的良好分塊策略：向句子級塊添加適當(dāng)?shù)脑獢?shù)據(jù)

· 一旦為塊生成了嵌入，它們就會存儲在 Qdrant Cloud 中相應(yīng)的集合中。

請記住，創(chuàng)建不良分塊只是為了展示不良分塊如何影響檢索。

下面是來自 Qdrant Cloud 的塊的屏幕截圖，您可以看到元數(shù)據(jù)被添加到句子級塊中以指示來源和主題。

基于分塊策略的檢索結(jié)果

現(xiàn)在讓我們編寫一些代碼來根據(jù)查詢從 Qdrant Vector DB 中檢索內(nèi)容。

Python

import qdrant_client

from qdrant_client.models import PointStruct, Distance, VectorParams

import openai

import yaml

# Load configuration

with open('config.yaml', 'r') as file:

config = yaml.safe_load(file)

# Initialize Qdrant client

client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key'])

# Initialize OpenAI with the API key

openai.api_key = config['openai']['api_key']

def embed_text(text):

print(f"Generating embedding for: '{text[:50]}'...") # Show a snippet of the text being embedded

response = openai.embeddings.create(

input=[text], # Input needs to be a list

model=config['openai']['model_name']

)

embedding = response.data[0].embedding # Access using the attribute, not as a dictionary

print(f"Generated embedding of length {len(embedding)}.") # Confirm embedding generation

return embedding

# Function to create a collection if it doesn't exist

def create_collection_if_not_exists(collection_name, vector_size):

collections = client.get_collections().collections

if collection_name not in [collection.name for collection in collections]:

client.create_collection(

collection_name=collection_name,

vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE)

)

print(f"Created collection: {collection_name} with vector size: {vector_size}") # Collection creation

else:

print(f"Collection {collection_name} already exists.") # Collection existence check

# Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example.

text = """

"""

# Poor Chunking Strategy

def poor_chunking(text, chunk_size=40):

chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]

print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}") # Show chunks produced

return chunks

# Good Chunking Strategy

def good_chunking(text):

import re

sentences = re.split(r'(?<=[.!?]) +', text)

print(f"Good Chunking produced {len(sentences)} chunks: {sentences}") # Show chunks produced

return sentences

# Good Chunking with Metadata

def good_chunking_with_metadata(text):

chunks = good_chunking(text)

metadata_chunks = []

for chunk in chunks:

if "healthcare" in chunk:

metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"})

elif "ethical implications" in chunk or "data privacy" in chunk:

metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"})

else:

metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"})

print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}") # Show chunks produced

return metadata_chunks

# Store chunks in Qdrant

def store_chunks(chunks, collection_name):

if len(chunks) == 0:

print(f"No chunks were generated for the collection '{collection_name}'.")

return

# Generate embedding for the first chunk to determine vector size

sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"]

sample_embedding = embed_text(sample_text)

vector_size = len(sample_embedding)

create_collection_if_not_exists(collection_name, vector_size)

for idx, chunk in enumerate(chunks):

text = chunk if isinstance(chunk, str) else chunk["text"]

embedding = embed_text(text)

payload = chunk if isinstance(chunk, dict) else {"text": text} # Always ensure there's text in the payload

client.upsert(collection_name=collection_name, points=[

PointStruct(id=idx, vector=embedding, payload=payload)

])

print(f"Chunks successfully stored in the collection '{collection_name}'.")

# Execute chunking and storing separately for each strategy

print("Starting poor_chunking...")

store_chunks(poor_chunking(text), "poor_chunking")

print("Starting good_chunking...")

store_chunks(good_chunking(text), "good_chunking")

print("Starting good_chunking_with_metadata...")

store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata")

上面的代碼執(zhí)行以下操作：

· 定義查詢并生成查詢的嵌入

· 搜索查詢設(shè)置為"ethical implications of AI in healthcare"。

· 該retrieve_and_print函數(shù)搜索特定的 Qdrant 集合并檢索最接近查詢嵌入的前 3 個向量。

現(xiàn)在讓我們看看輸出：

python retrieval_test.py

Results from 'poor_chunking' collection for the query: 'ethical implications of AI in healthcare':

Result 1:

Text: . The ethical implications of AI in heal

Source: N/A

Topic: N/A

Result 2:

Text: ant impact is healthcare. AI is being us

Source: N/A

Topic: N/A

Result 3:

Text:

Artificial intelligence is transforming

Source: N/A

Topic: N/A

Results from 'good_chunking' collection for the query: 'ethical implications of AI in healthcare':

Result 1:

Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.

Source: N/A

Topic: N/A

Result 2:

Text: One of the key areas where AI is making a significant impact is healthcare.

Source: N/A

Topic: N/A

Result 3:

Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

Source: N/A

Topic: N/A

Results from 'good_chunking_with_metadata' collection for the query: 'ethical implications of AI in healthcare':

Result 1:

Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues.

Source: Healthcare Section

Topic: AI in Healthcare

Result 2:

Text: One of the key areas where AI is making a significant impact is healthcare.

Source: Healthcare Section

Topic: AI in Healthcare

Result 3:

Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone.

Source: General

Topic: AI Overview

同一搜索查詢的輸出根據(jù)實施的分塊策略而有所不同。

· 分塊策略不佳：您可以注意到，這里的結(jié)果不太相關(guān)，這是因為文本被分成了任意的小塊。

· 良好的分塊策略：這里的結(jié)果更相關(guān)，因為文本被分成句子，保留了語義含義。

· 使用元數(shù)據(jù)進行良好的分塊策略：這里的結(jié)果最準(zhǔn)確，因為文本經(jīng)過深思熟慮地分塊并使用元數(shù)據(jù)進行增強。

從實驗中得出的推論

· 分塊需要精心制定策略，并且塊大小不宜太小或太大。

· 分塊不當(dāng)?shù)囊粋€例子是，塊太小，在非自然的地方切斷句子，或者塊太大，同一個塊中包含多個主題，這使得檢索非?；靵y。

· 分塊的整個想法都圍繞著為 LLM 提供更好的背景的概念。

· 元數(shù)據(jù)通過提供額外的上下文層極大地增強了結(jié)構(gòu)正確的分塊。例如，我們已將來源和主題作為元數(shù)據(jù)元素添加到我們的分塊中。

· 檢索系統(tǒng)受益于這些附加信息。例如，如果元數(shù)據(jù)表明某個區(qū)塊屬于“醫(yī)療保健部分”，則系統(tǒng)可以在進行與醫(yī)療保健相關(guān)的查詢時優(yōu)先考慮這些區(qū)塊。

· 通過改進分塊，結(jié)果可以結(jié)構(gòu)化和分類。如果查詢與同一文本中的多個上下文匹配，我們可以通過查看塊的元數(shù)據(jù)來確定信息屬于哪個上下文或部分。

牢記這些策略，并在基于 LLM 的搜索應(yīng)用程序中分塊取得成功。

本站聲明：本文章由作者或相關(guān)機構(gòu)授權(quán)發(fā)布，目的在于傳遞更多信息，并不代表本站贊同其觀點，本站亦不保證或承諾內(nèi)容真實性等。需要轉(zhuǎn)載請聯(lián)系該專欄作者，如若文章內(nèi)容侵犯您的權(quán)益，請及時聯(lián)系本站刪除。

換一批

ESP32憑什么成物聯(lián)網(wǎng)開發(fā)“香餑餑”？

在物聯(lián)網(wǎng)設(shè)備數(shù)量突破千億級的今天，開發(fā)者對核心芯片的訴求已從單一功能轉(zhuǎn)向“全棧集成+生態(tài)協(xié)同”。樂鑫科技推出的ESP32憑借其獨特的“雙核架構(gòu)+無線雙模+開源生態(tài)”組合，成為智能家居、工業(yè)監(jiān)控、可穿戴設(shè)備等領(lǐng)域的首選方案...

關(guān)鍵字： ESP32 物聯(lián)網(wǎng)

[智能應(yīng)用]

為什么 GPU 芯片需要嵌入式液冷?

在當(dāng)今數(shù)字化時代，人工智能(AI)和高性能計算(HPC)的迅猛發(fā)展對 GPU 芯片的性能提出了極高要求。隨著 GPU 計算密度和功耗的不斷攀升，散熱問題成為了制約其性能發(fā)揮的關(guān)鍵因素。傳統(tǒng)的風(fēng)冷方案已難以滿足日益增長的散...

關(guān)鍵字：人工智能高性能計算芯片

[智能應(yīng)用]

MCP：在傳統(tǒng) API 之外重塑 AI 開發(fā)

在人工智能飛速發(fā)展的當(dāng)下，大模型展現(xiàn)出了強大的語言理解與生成能力。然而，要讓這些模型真正在實際場景中發(fā)揮作用，與外部豐富的工具及數(shù)據(jù)源順暢交互至關(guān)重要。在此背景下，Model Context Protocol(MCP)，...

關(guān)鍵字：人工智能大模型協(xié)議

[智能應(yīng)用]

一種基于LED光源的電氣控制系統(tǒng)

LED智能調(diào)光系統(tǒng)是一種基于LED光源的電氣控制系統(tǒng)，主要應(yīng)用于酒店、展廳、劇場及商業(yè)建筑等場景，可實現(xiàn)動態(tài)調(diào)節(jié)光通量和照度。

關(guān)鍵字： LED智能調(diào)光系統(tǒng)

[智能應(yīng)用]

在DAB中的一種高效控制策略介紹

在DAB中，兩個橋的占空比通常保持在50%，功率流動是通過改變兩個電橋之間的相位即相移(phase shift)而實現(xiàn)的。

關(guān)鍵字：雙有源橋

[智能應(yīng)用]

常用的調(diào)光技術(shù)之脈沖寬度調(diào)制(PWM)

電容觸摸技術(shù)作為一種實用、時尚的人機交互方式，已經(jīng)被廣泛的應(yīng)用到各種電子產(chǎn)品，小到電燈開關(guān)，大到平板電腦、觸摸桌等。

關(guān)鍵字：電容觸摸

[智能應(yīng)用]

平安城市視頻監(jiān)控架構(gòu)，端-邊-云協(xié)同的4K8K超高清編碼與存儲優(yōu)化

在平安城市建設(shè)中，視頻監(jiān)控系統(tǒng)正從標(biāo)清向4K/8K超高清方向發(fā)展。超高清視頻雖能提供更豐富的細節(jié)(如人臉特征、車牌號碼)，但也帶來數(shù)據(jù)量激增(8K視頻碼流達100Mbps)、傳輸延遲升高、存儲成本攀升等問題。端-邊-云協(xié)...

關(guān)鍵字：平安城市視頻監(jiān)控

[智能應(yīng)用]

井蓋位移監(jiān)測系統(tǒng)：低功耗藍牙（BLE）與邊緣計算的實時預(yù)警設(shè)計

在智慧城市建設(shè)中，井蓋位移監(jiān)測是保障市政設(shè)施安全運行的關(guān)鍵環(huán)節(jié)。傳統(tǒng)人工巡檢方式存在效率低、響應(yīng)慢等問題，而基于低功耗藍牙(BLE)與邊緣計算的實時預(yù)警系統(tǒng)，通過物聯(lián)網(wǎng)技術(shù)實現(xiàn)了對井蓋狀態(tài)的實時感知與智能分析。本文從系統(tǒng)...

關(guān)鍵字：井蓋位移 BLE

[智能應(yīng)用]

邊緣AI在M2M中的應(yīng)用：TensorFlow Lite Micro的輕量化模型部署與優(yōu)化

在萬物互聯(lián)的M2M(機器對機器)通信場景中，邊緣AI正通過將計算能力下沉至終端設(shè)備，重構(gòu)傳統(tǒng)物聯(lián)網(wǎng)架構(gòu)。以TensorFlow Lite Micro(TFLite Micro)為核心的輕量化模型部署方案，憑借其低功耗、低...

關(guān)鍵字：邊緣AI M2M

[智能應(yīng)用]