Big Data Tools

Compatible with IntelliJ IDEA Ultimate, DataGrip and 2 more
Screenshot 1
Screenshot 2

A bundle of plugins for data engineers and other specialists engaged with big data workloads. Installed in your favorite JetBrains IDE, Big Data Tools helps develop, visualize, debug, and monitor big data pipelines built in Scala, Python, and SQL.


Use Big Data Tools for:

  • Exploratory analysis, visualization, and prototyping jobs in Zeppelin notebooks.
  • Running and monitoring Spark or Flink jobs directly from your IDE.
  • Working with Amazon EMR clusters.
  • Viewing big data files, such as CSV, Parquet, ORC, and Avro.
  • Producing and consuming messages with Kafka.
  • Previewing Hive Metastore databases.
  • Getting insights about your Hadoop environment.

Built-in tools and integrations:

  • Supported languages: Scala, Python, SQL.
  • Notebooks: Zeppelin.
  • Monitoring: Hadoop, Kafka, Spark, Hive Metastore, Flink, AWS Glue.
  • Remote file storages: AWS S3, Google Cloud Storage, Microsoft Azure, Tencent Cloud Object Storage (COS), DigitalOcean Spaces, Alibaba OSS, Hadoop Distributed File System (HDFS), and more.
  • File systems: HDFS, Local, SFTP.
  • Data processing platforms: AWS EMR.

What’s New

Changes in 2023.3

Big Data Tools

Features

  • Added Speed Search to the connection tree in Settings | Tools | Big Data Tools

Fixed Bugs

  • Fixed performance problems with password storage

Kafka

Features

  • Supported local Protobuf and Avro schema files
  • Added the Show schema action for local schema files
  • In Spring projects, Kafka connections can be created directly from the gutter
  • Added ability to set and create consumer groups for Kafka consumer
  • Extended the usability and functionality of the Consumer groups section of the Kafka plugin
  • Added highlighting for avro messages for Consumer
  • Added integer type for keys/values for Consumer and Producer
  • Added Speed Search for Kafka tool window
  • Added the Local port field for connections via SSH tunnel and a hint about limitation of such connections
  • Now, installing the Kafka plugin does not require the IDE restart

Fixed Bugs

  • Fixed issues with applying presets to Producer
  • Improved validation warnings and error messages
  • Better column sizing after changing the Producer options and table clearing
  • Fixed generation and tree view for JSON and Protobuf schemas with references
  • It’s possible now to put multiple schema registry URL in Kafka configuration
  • Key/Value editor fields are hidden when random generation enabled
  • Improved the Properties source field for Kafka connection settings
  • Fixed exception related to the slash symbol in a Topic or Schema name

Spark

Features

  • Added ability to submit Spark applications on a DataProc cluster
  • Added ability custom Spark cluster connection for submitting applications
  • Added automatic run configuration suggestion for PySpark projects
  • New Pyspark project wizard available
  • Spark project wizard reworked
  • Added automatic run configuration suggestion for Spark projects using Scala 3
  • Added debugging support for JVM Spark applications that were run with Run configuration
  • Added static analysis for Spark and Pyspark DataFrame columns
  • Added autocompletion for Spark and Pyspark DataFrame columns
  • Added Spark jobs monitoring (EMR, Dataproc, Custom cluster) into Services toolwindow
  • Spark monitoring Executor logs now provide two options - stdout and stderr
  • View settings for Spark monitoring Stages in the Jobs tab and in the Stages tab are independent now
  • Added Spark monitoring and SFTP templates for EMR clusters in the Big Data Tools toolwindow
  • Added button to copy Spark application logs from Console
  • Added ability to filter Spark applications `by me` (submitted by the current user)

Fixed Bugs

  • Fixed limitations for the Spark applications list to avoid lack of necessary applications
  • Spark and Hadoop monitoring view settings are saved correctly now
  • `Delete Job` button is now disabled when a Job is executing
  • Supported sso-session property for AWS connections
  • AWS Glue region is saved correctly now

Flink

  • Fixed some localization issues

Remote File Systems

Features

  • Added ability to filter by prefix for storages
  • Added path indication to editor viewer tab
  • Updated regions list for Linode and DigitalOcean Spaces
  • New regions supported for Alibaba OSS
  • Added possibility to set a custom Endpoint and Region for Alibaba OSS connection

Fixed Bugs

  • Fixed sorting in Editor viewer
  • Fixed constantly collapsed nodes
  • Fixed credentials caching issue for S3 connections
  • Fixed hanging UI on automatic driver refresh
  • Content-type field added to file info for Alibaba OSS
  • Reload, upload, compare buttons are available again for remote editable files
  • File size for S3 connections is displayed more accurately now
  • Shortcuts for copy file and copy absolute path are similar to the Project panel now

Big Data File Viewer

  • Fixed issues on DDL opening of for tabular files
  • Fixed int32 date column display in parquet files
  • Fixed rendering of microseconds and nanoseconds in parquet file timestamps
  • Parquet file viewer shows commas correctly in JSON formatted columns
  • Umlaut characters are displayed now in the parquet files
  • Fixed displayed ‘map’ columns for parquet files
  • Fixed bug with opening zstd compressed parquet files
  • Timestamp columns are displayed independent on what time zone the user opened in
  • Array columns are displayed correctly now: header name, delimiter, and empty arrays
  • Fixed issues with displaying rows in parquet files
  • Data files with Chinese names can be previewed now
  • Avro files are opened properly now
  • Fixed exception on opening files with special characters in their names
  • The tab of the deleted file is closed properly now
  • Fixed issue while opening an empty CSV file

Zeppelin

  • Indents are correct now in the Go To The Paragraph window
  • Fixed Scala autocompletion with Kotlin plugin disabled
  • Fixed some exceptions on connection to Zeppelin and on opening a Zeppelin notebook

The full release notes are available here.

Feb 06, 2025
Version 251.20015.29

Getting Started

* These instructions link to the Big Data Tools documentation for IntelliJ IDEA. If you use the plugin with a different JetBrains IDE, please use one of the links below instead:

PyCharm | DataSpell | DataGrip

To start using Big Data Tools in IntelliJ IDEA:
  1. Install the Big Data Tools plugin.
  2. Install the required language plugins (Scala or Python).
  3. Create a new project in your IDE.
  4. Connect to a particular server, storage, or service.
  5. Voila! You’re ready to start working on your project.

Rating & Reviews

4.1
38 Ratings (2,888,656 Downloads)
5
4
3
2
1

Superb, makes my life so much easier in S3.

0

孙琦夫

07.01.2025

Unable to load very large folders when accessing AWS-compatible storage. Consider loading only the first 1,000 files initially and providing an option to manually trigger loading more files.

0

Vasyl Khrystiuk

08.08.2024

Nice plugin but after upgrade to 2024.2 it is disabled because of missing "com.intellij.bigdatatools.databricks". The dependecny can be downloaded here: https://plugins.jetbrains.com/plugin/24359-databricks OR simply by reinstalling the plugin (all the settings and connections been kept for me)

+3

Additional Information

Vendor:
Plugin ID:
com.intellij.bigdatatools