Package Details: apache-spark 3.5.1-1

Git Clone URL: https://aur.archlinux.org/apache-spark.git (read-only, click to copy)
Package Base: apache-spark
Description: A unified analytics engine for large-scale data processing
Upstream URL: http://spark.apache.org
Keywords: spark
Licenses: Apache
Submitter: huitseeker
Maintainer: aakashhemadri
Last Packager: aakashhemadri
Votes: 57
Popularity: 0.015223
First Submitted: 2015-10-04 09:31 (UTC)
Last Updated: 2024-05-07 17:40 (UTC)

Dependencies (2)

Required by (2)

Sources (4)

Latest Comments

« First ‹ Previous 1 2 3 4 5 6 7 Next › Last »

ryukinix commented on 2020-01-17 23:12 (UTC) (edited on 2020-01-17 23:13 (UTC) by ryukinix)

Updating to spark 3.0.0-preview2 it will make works with Python3.8. I'm using this modified version of PKGBUILD: https://github.com/ryukinix/apache-spark-pkgbuild. Until now it's working fine.

5 days ago it was released the v2.4.5 github tag, but is not available yet on apache spark archive the compiled version, for that reason I used 3.0.0 preview2 version which is the most recent version available in the archive.

v2.4.5 and v3.0.0 versions contains the commit that fix the problems with python3.8: https://github.com/apache/spark/commit/811d563fbf60203377e8462e4fad271c1140b4fa

YaLTeR commented on 2019-12-13 08:25 (UTC) (edited on 2019-12-13 08:45 (UTC) by YaLTeR)

Doesn't seem to work with Python 3 (the default)?

└─ pyspark
Picked up _JAVA_OPTIONS: -Dawt.useSystemAAFontSettings=on -Dswing.aatext=true
Python 3.8.0 (default, Oct 23 2019, 18:51:26)
[GCC 9.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):
  File "/opt/apache-spark/python/pyspark/shell.py", line 31, in <module>
    from pyspark import SparkConf
  File "/opt/apache-spark/python/pyspark/__init__.py", line 51, in <module>
    from pyspark.context import SparkContext
  File "/opt/apache-spark/python/pyspark/context.py", line 31, in <module>
    from pyspark import accumulators
  File "/opt/apache-spark/python/pyspark/accumulators.py", line 97, in <module>
    from pyspark.serializers import read_int, PickleSerializer
  File "/opt/apache-spark/python/pyspark/serializers.py", line 71, in <module>
    from pyspark import cloudpickle
  File "/opt/apache-spark/python/pyspark/cloudpickle.py", line 145, in <module>
    _cell_set_template_code = _make_cell_set_template_code()
  File "/opt/apache-spark/python/pyspark/cloudpickle.py", line 126, in _make_cell_set_template_code
    return types.CodeType(
TypeError: an integer is required (got type bytes)

Update: looks like it's Python 3.8; installing python37 from the AUR and using that seems to work.

tir commented on 2019-07-17 10:35 (UTC) (edited on 2019-07-17 10:39 (UTC) by tir)

I had success with the following PKGBUILD (patch: https://git.io/fj1CT).

# Maintainer: François Garillot ("huitseeker") <francois [at] garillot.net>
# Contributor: Christian Krause ("wookietreiber") <kizkizzbangbang@gmail.com>

pkgname=apache-spark
pkgver=2.4.3
pkgrel=1
pkgdesc="fast and general engine for large-scale data processing"
arch=('any')
url="http://spark.apache.org"
license=('APACHE')
depends=('java-environment>=6' 'java-environment<9')
optdepends=('python2: python2 support for pyspark'
            'ipython2: ipython2 support for pyspark'
            'python: python3 support for pyspark'
            'ipython: ipython3 support for pyspark'
            'r: support for sparkR'
            'rsync: support rsync hadoop binaries from master'
            'hadoop: support for running on YARN')
install=apache-spark.install
source=("https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=spark/spark-${pkgver}/spark-${pkgver}-bin-without-hadoop.tgz"
        'apache-spark-master.service'
        'apache-spark-slave@.service'
        'spark-env.sh'
        'spark-daemon-run.sh'
        'run-master.sh'
        'run-slave.sh')
sha1sums=('54bf6a19eb832dc0cf2d7a7465b785390d00122b'
          'ac71d12070a9a10323e8ec5aed4346b1dd7f21c6'
          'a191e4f8f7f8bbc596f4fadfb3c592c3efbc4fc0'
          '3fa39d55075d4728bd447692d648053c9f6b07ec'
          '08557d2d5328d5c99e533e16366fd893fffaad78'
          '323445b8d64aea0534a2213d2600d438f406855b'
          '65b1bc5fce63d1fa7a1b90f2d54a09acf62012a4')
backup=('etc/apache-spark/spark-env.sh')

PKGEXT=${PKGEXT:-'.pkg.tar.xz'}

prepare() {
  cd "$srcdir/spark-${pkgver}-bin-without-hadoop"
}

package() {
        cd "$srcdir/spark-${pkgver}-bin-without-hadoop"

        install -d "$pkgdir/usr/bin" "$pkgdir/opt" "$pkgdir/var/log/apache-spark" "$pkgdir/var/lib/apache-spark/work"
        chmod 2775 "$pkgdir/var/log/apache-spark" "$pkgdir/var/lib/apache-spark/work"

        cp -r "$srcdir/spark-${pkgver}-bin-without-hadoop" "$pkgdir/opt/apache-spark/"

        cd "$pkgdir/usr/bin"
        for binary in beeline pyspark sparkR spark-class spark-shell find-spark-home spark-sql spark-submit load-spark-env.sh; do
                binpath="/opt/apache-spark/bin/$binary"
                ln -s "$binpath" $binary
                sed -i 's|^export SPARK_HOME=.*$|export SPARK_HOME=/opt/apache-spark|' "$pkgdir/$binpath"
                sed -i -Ee 's/\$\(dirname "\$0"\)/$(dirname "$(readlink -f "$0")")/g' "$pkgdir/$binpath"
        done

        mkdir -p $pkgdir/etc/profile.d
        echo '#!/bin/sh' > $pkgdir/etc/profile.d/apache-spark.sh
        echo 'SPARK_HOME=/opt/apache-spark' >> $pkgdir/etc/profile.d/apache-spark.sh
        echo 'export SPARK_HOME' >> $pkgdir/etc/profile.d/apache-spark.sh
        chmod 755 $pkgdir/etc/profile.d/apache-spark.sh

        install -Dm644 "$srcdir/apache-spark-master.service" "$pkgdir/usr/lib/systemd/system/apache-spark-master.service"
        install -Dm644 "$srcdir/apache-spark-slave@.service" "$pkgdir/usr/lib/systemd/system/apache-spark-slave@.service"
        install -Dm644 "$srcdir/spark-env.sh" "$pkgdir/etc/apache-spark/spark-env.sh"
        for script in run-master.sh run-slave.sh spark-daemon-run.sh; do
            install -Dm755 "$srcdir/$script" "$pkgdir/opt/apache-spark/sbin/$script"
        done
        install -Dm644 "$srcdir/spark-${pkgver}-bin-without-hadoop/conf"/* "$pkgdir/etc/apache-spark"

        cd "$pkgdir/opt/apache-spark"
        mv conf conf-templates
        ln -sf "/etc/apache-spark" conf
        ln -sf "/var/lib/apache-spark/work" .
}

There is another, unrelated issue probably with JLine (as noted here: https://github.com/sanori/spark-sbt/issues/4#issuecomment-401621777), for which the workaround would be:

$ TERM=xterm-color spark-shell
...
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 10.0.2)
Type in expressions to have them evaluated.
Type :help for more information.

lukaszimmermann commented on 2019-07-15 09:05 (UTC)

Hi, sorry for being unresponsive the last couple of month. Emanuel I added you as co-maintainer, can you perhaps update this package with the new PKGBUILD for Spark 2.4.3.

emanuelfontelles commented on 2019-07-13 17:56 (UTC) (edited on 2019-07-13 17:57 (UTC) by emanuelfontelles)

Hey guys, here are a full PKGBUILD with Spark-2.4.3, with Hadoop binaries, its totally works

# Maintainer: François Garillot ("huitseeker") <francois [at] garillot.net>
# Contributor: Christian Krause ("wookietreiber") <kizkizzbangbang@gmail.com>
# Contributor: Emanuel Fontelles ("emanuelfontelles") <emanuelfontelles@hotmail.com>

pkgname=apache-spark
pkgver=2.4.3
pkgrel=1
pkgdesc="fast and general engine for large-scale data processing"
arch=('any')
url="http://spark.apache.org"
license=('APACHE')
depends=('java-environment>=6' 'java-environment<9')
optdepends=('python2: python2 support for pyspark'
            'ipython2: ipython2 support for pyspark'
            'python: python3 support for pyspark'
            'ipython: ipython3 support for pyspark'
            'r: support for sparkR'
            'rsync: support rsync hadoop binaries from master'
            'hadoop: support for running on YARN')

install=apache-spark.install
source=("https://archive.apache.org/dist/spark/spark-${pkgver}/spark-${pkgver}-bin-hadoop2.7.tgz"
        'apache-spark-master.service'
        'apache-spark-slave@.service'
        'spark-env.sh'
        'spark-daemon-run.sh'
        'run-master.sh'
        'run-slave.sh')
sha1sums=('7b2f1be5c4ccec86c6d2b1e54c379b7af7a5752a'
          'ac71d12070a9a10323e8ec5aed4346b1dd7f21c6'
          'a191e4f8f7f8bbc596f4fadfb3c592c3efbc4fc0'
          '3fa39d55075d4728bd447692d648053c9f6b07ec'
          '08557d2d5328d5c99e533e16366fd893fffaad78'
          '323445b8d64aea0534a2213d2600d438f406855b'
          '65b1bc5fce63d1fa7a1b90f2d54a09acf62012a4')
backup=('etc/apache-spark/spark-env.sh')

PKGEXT=${PKGEXT:-'.pkg.tar.xz'}

prepare() {
  cd "$srcdir/spark-${pkgver}-bin-hadoop2.7"
}

package() {
        cd "$srcdir/spark-${pkgver}-bin-hadoop2.7"

        install -d "$pkgdir/usr/bin" "$pkgdir/opt" "$pkgdir/var/log/apache-spark" "$pkgdir/var/lib/apache-spark/work"
        chmod 2775 "$pkgdir/var/log/apache-spark" "$pkgdir/var/lib/apache-spark/work"

        cp -r "$srcdir/spark-${pkgver}-bin-hadoop2.7" "$pkgdir/opt/apache-spark/"

        cd "$pkgdir/usr/bin"
        for binary in beeline pyspark sparkR spark-class spark-shell find-spark-home spark-sql spark-submit load-spark-env.sh; do
                binpath="/opt/apache-spark/bin/$binary"
                ln -s "$binpath" $binary
                sed -i 's|^export SPARK_HOME=.*$|export SPARK_HOME=/opt/apache-spark|' "$pkgdir/$binpath"
                sed -i -Ee 's/\$\(dirname "\$0"\)/$(dirname "$(readlink -f "$0")")/g' "$pkgdir/$binpath"
        done

        mkdir -p $pkgdir/etc/profile.d
        echo '#!/bin/sh' > $pkgdir/etc/profile.d/apache-spark.sh
        echo 'SPARK_HOME=/opt/apache-spark' >> $pkgdir/etc/profile.d/apache-spark.sh
        echo 'export SPARK_HOME' >> $pkgdir/etc/profile.d/apache-spark.sh
        chmod 755 $pkgdir/etc/profile.d/apache-spark.sh

        install -Dm644 "$srcdir/apache-spark-master.service" "$pkgdir/usr/lib/systemd/system/apache-spark-master.service"
        install -Dm644 "$srcdir/apache-spark-slave@.service" "$pkgdir/usr/lib/systemd/system/apache-spark-slave@.service"
        install -Dm644 "$srcdir/spark-env.sh" "$pkgdir/etc/apache-spark/spark-env.sh"
        for script in run-master.sh run-slave.sh spark-daemon-run.sh; do
            install -Dm755 "$srcdir/$script" "$pkgdir/opt/apache-spark/sbin/$script"
        done
        install -Dm644 "$srcdir/spark-${pkgver}-bin-hadoop2.7/conf"/* "$pkgdir/etc/apache-spark"

        cd "$pkgdir/opt/apache-spark"
        mv conf conf-templates
        ln -sf "/etc/apache-spark" conf
        ln -sf "/var/lib/apache-spark/work" .
}

marcinn commented on 2019-06-14 20:46 (UTC) (edited on 2019-06-14 20:46 (UTC) by marcinn)

PKGBUILD patch for 2.4.3:

diff --git a/PKGBUILD b/PKGBUILD
index 3540d3e..e31fec9 100644
--- a/PKGBUILD
+++ b/PKGBUILD
@@ -2,7 +2,7 @@
 # Contributor: Christian Krause ("wookietreiber") <kizkizzbangbang@gmail.com>

 pkgname=apache-spark
-pkgver=2.4.0
+pkgver=2.4.3
 pkgrel=1
 pkgdesc="fast and general engine for large-scale data processing"
 arch=('any')
@@ -17,14 +17,14 @@ optdepends=('python2: python2 support for pyspark'
             'rsync: support rsync hadoop binaries from master'
             'hadoop: support for running on YARN')
 install=apache-spark.install
-source=("https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=spark/spark-${pkgver}/spark-${pkgver}-bin-without-hadoop.tgz"
+source=("https://archive.apache.org/dist/spark/spark-${pkgver}/spark-${pkgver}-bin-without-hadoop.tgz"
         'apache-spark-master.service'
         'apache-spark-slave@.service'
         'spark-env.sh'
         'spark-daemon-run.sh'
         'run-master.sh'
         'run-slave.sh')
-sha1sums=('ce6fe98272b78a5c487d16f0f5f828908b65d7fe'
+sha1sums=('54bf6a19eb832dc0cf2d7a7465b785390d00122b'
           'ac71d12070a9a10323e8ec5aed4346b1dd7f21c6'
           'a191e4f8f7f8bbc596f4fadfb3c592c3efbc4fc0'
           '3fa39d55075d4728bd447692d648053c9f6b07ec'

sabitmaulanaa commented on 2019-05-27 14:23 (UTC)

is this package will be updated?

lukaszimmermann commented on 2019-04-26 09:29 (UTC)

Works, thanks!

blurbdust commented on 2019-04-26 00:31 (UTC)

Here is my current PKGBUILD that works for the latest spark as on 4/23/19

https://gist.github.com/blurbdust/3e03531d890c8bcf1da3c3d1192ce4d5

lukaszimmermann commented on 2019-04-25 09:14 (UTC)

I would be happy to become maintainer of this package, then I can take care of some open issues here.