Spark
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including
Spark SQLfor SQL and DataFrames,pandas API on Sparkfor pandas workloads,MLlibfor machine learning,GraphXfor graph processing, andStructured Streamingfor stream processing.
Document loadersโ
PySparkโ
It loads data from a PySpark DataFrame.
See a usage example.
from langchain_community.document_loaders import PySparkDataFrameLoader
API Reference:PySparkDataFrameLoader
Tools/Toolkitsโ
Spark SQL toolkitโ
Toolkit for interacting with Spark SQL.
See a usage example.
from langchain_community.agent_toolkits import SparkSQLToolkit, create_spark_sql_agent
from langchain_community.utilities.spark_sql import SparkSQL
Spark SQL individual toolsโ
You can use individual tools from the Spark SQL Toolkit:
InfoSparkSQLTool: tool for getting metadata about a Spark SQLListSparkSQLTool: tool for getting tables namesQueryCheckerTool: tool uses an LLM to check if a query is correctQuerySparkSQLTool: tool for querying a Spark SQL
from langchain_community.tools.spark_sql.tool import InfoSparkSQLTool
from langchain_community.tools.spark_sql.tool import ListSparkSQLTool
from langchain_community.tools.spark_sql.tool import QueryCheckerTool
from langchain_community.tools.spark_sql.tool import QuerySparkSQLTool