04 Guide to static functions for Apache Spark v2.3.4 · from_json(Column e, DataType schema,...

M A N N I N G

Jean-Georges Perrin

Save 50% on this book – eBook, pBook, and MEAP. Enter mesias50 in the Promotional Code box when you checkout. Only at manning.com.

Spark in Action, Second Editionby Jean-Georges Perrin

ISBN 9781617295522565 pages$47.99

http://manning.com

http://manning.com

http://manning.com

https://www.manning.com/books/spark-in-action-second-edition

https://www.manning.com/books/spark-in-action-second-edition

Guide to static functions for Apache Spark v2.3.4 Jean-Georges Perrin

Copyright 2019 Manning PublicationsTo pre-order or learn more about these books go to www.manning.com

http://www.manning.com/

For online information and ordering of these and other Manning books, please visit www.manning.com. The publisher offers discounts on these books when ordered in quantity.

For more information, please contact

Special Sales DepartmentManning Publications Co.20 Baldwin RoadPO Box 761Shelter Island, NY 11964Email: Erin Twohey, [email protected]

©2019 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning booksare printed on paper that is at least 15 percent recycled and processed without the use of elemental chlorine.

Manning Publications Co. 20 Baldwin Road Technical PO Box 761Shelter Island, NY 11964

Cover designer: Leslie Haimes

ISBN: 9781617297885Printed in the United States of America1 2 3 4 5 6 7 8 9 10 - EBM - 24 23 22 21 20 19

http://www.manning.com

contentsStatic functions ease your transformations 1

1.1 Functions per category 2Popular functions 2

Aggregate functions 2

Arithmetical functions 3

Array manipulation functions 3

Binary operations 3

Comparison functions 3

Compute function 3

Conditional operations 3

Conversion functions 3

Data shape functions 3

Date and time functions 4

Digest functions 4

Encoding functions 4

Formatting functions 4

JSON (JavaScript object notation) functions 4

List functions 4

Mathematical functions 4

Navigation functions 5

Rounding functions 5

Sorting functions 5

iii

iv CONTENTS

Statistical functions 5

Streaming functions 5

String functions 5

Technical functions 5

Trigonometry functions 6

UDFs (user-defined functions) helpers 6

Validation functions 6

Deprecated functions 6

1.2 Functions appearance per version of Spark 6Functions appeared in Spark v2.3.0 6

Functions appeared in Spark v2.2.0 6







1.3 Reference for functions 7abs(Column e) 8

acos(Column e) 8

acos(String columnName) 8

add_months(Column startDate, int numMonths) 8

approxCountDistinct(Column e) 8

approxCountDistinct(Column e, double rsd) 9

approxCountDistinct(String columnName) 9

approxCountDistinct(String columnName, double rsd) 9

approx_count_distinct(Column e) 10

approx_count_distinct(Column e, double rsd) 10

approx_count_distinct(String columnName) 10

approx_count_distinct(String columnName, double rsd) 10

array(Column... cols) 10

array(String colName, String... colNames) 11

array(String colName, scala.collection.Seq<String> colNames) 11

array(scala.collection.Seq<Column> cols) 11

vCONTENTS

array_contains(Column column, Object value) 11

asc(String columnName) 12

asc_nulls_first(String columnName) 12

asc_nulls_last(String columnName) 12

ascii(Column e) 12

asin(Column e) 13

asin(String columnName) 13

atan(Column e) 13

atan(String columnName) 13

atan2(Column y, Column x) 13

atan2(Column y, String xName) 14

atan2(Column y, double xValue) 14

atan2(String yName, Column x) 14

atan2(String yName, String xName) 15

atan2(String yName, double xValue) 15

atan2(double yValue, Column x) 15

atan2(double yValue, String xName) 16

avg(Column e) 16

avg(String columnName) 16

base64(Column e) 16

bin(Column e) 16

bin(String columnName) 17

bitwiseNOT(Column e) 17

broadcast(Dataset<T> df) 17

bround(Column e) 17

bround(Column e, int scale) 18

callUDF(String udfName, Column... cols) 18

callUDF(String udfName, scala.collection.Seq<Column> cols) 18

cbrt(Column e) 19

cbrt(String columnName) 19

ceil(Column e) 19

ceil(String columnName) 19

coalesce(Column... e) 19

coalesce(scala.collection.Seq<Column> e) 20

vi CONTENTS

col(String colName) 20

collect_list(Column e) 20

collect_list(String columnName) 20

collect_set(Column e) 20

collect_set(String columnName) 21

column(String colName) 21

concat(Column... exprs) 21

concat(scala.collection.Seq<Column> exprs) 21

concat_ws(String sep, Column... exprs) 21

concat_ws(String sep, scala.collection.Seq<Column> exprs) 22

conv(Column num, int fromBase, int toBase) 22

corr(Column column1, Column column2) 22

corr(String columnName1, String columnName2) 23

cos(Column e) 23

cos(String columnName) 23

cosh(Column e) 23

cosh(String columnName) 23

count(Column e) 24

count(String columnName) 24

countDistinct(Column expr, Column... exprs) 24

countDistinct(Column expr, scala.collection.Seq<Column> exprs) 24

countDistinct(String columnName, String... columnNames) 24

countDistinct(String columnName, scala.collection.Seq<String> columnNames) 25

covar_pop(Column column1, Column column2) 25

covar_pop(String columnName1, String columnName2) 25

covar_samp(Column column1, Column column2) 26

covar_samp(String columnName1, String columnName2) 26

crc32(Column e) 26

cume_dist() 26

currentRow() 27

current_date() 27

current_timestamp() 27

viiCONTENTS

date_add(Column start, int days) 27

date_format(Column dateExpr, String format) 27

date_sub(Column start, int days) 28

date_trunc(String format, Column timestamp, format:) 28

datediff(Column end, Column start) 28

dayofmonth(Column e) 29

dayofweek(Column e) 29

dayofyear(Column e) 29

decode(Column value, String charset) 29

degrees(Column e) 29

degrees(String columnName) 30

dense_rank() 30

desc(String columnName) 30

desc_nulls_first(String columnName) 30

desc_nulls_last(String columnName) 31

encode(Column value, String charset) 31

exp(Column e) 31

exp(String columnName) 31

explode(Column e) 32

explode_outer(Column e) 32

expm1(Column e) 32

expm1(String columnName) 32

expr(String expr) 32

factorial(Column e) 33

first(Column e) 33

first(Column e, boolean ignoreNulls) 33

first(String columnName) 33

first(String columnName, boolean ignoreNulls) 34

floor(Column e) 34

floor(String columnName) 34

format_number(Column x, int d) 34

format_string(String format, Column... arguments) 35

format_string(String format, scala.collection.Seq<Column> arguments) 35

viii CONTENTS

from_json(Column e, DataType schema) 35

from_json(Column e, DataType schema, java.util.Map<String,String> options) 35

from_json(Column e, DataType schema, scala.collection.immutable.Map<String,String> options) 36

from_json(Column e, String schema, java.util.Map<String,String> options) 36

from_json(Column e, String schema, scala.collection.immutable.Map<String,String> options) 37

from_json(Column e, StructType schema) 37

from_json(Column e, StructType schema, java.util.Map<String,String> options) 37

from_json(Column e, StructType schema, scala.collection.immutable.Map<String,String> options) 38

from_unixtime(Column ut) 38

from_unixtime(Column ut, String f) 38

from_utc_timestamp(Column ts, String tz) 38

get_json_object(Column e, String path) 39

greatest(Column... exprs) 39

greatest(String columnName, String... columnNames) 39

greatest(String columnName, scala.collection.Seq<String> columnNames) 40

greatest(scala.collection.Seq<Column> exprs) 40

grouping(Column e) 40

grouping(String columnName) 40

grouping_id(String colName, scala.collection.Seq<String> colNames) 41

grouping_id(scala.collection.Seq<Column> cols) 41

hash(Column... cols) 41

hash(scala.collection.Seq<Column> cols) 41

hex(Column column) 41

hour(Column e) 42

hypot(Column l, Column r) 42

hypot(Column l, String rightName) 42

hypot(Column l, double r) 42

ixCONTENTS

hypot(String leftName, Column r) 43

hypot(String leftName, String rightName) 43

hypot(String leftName, double r) 43

hypot(double l, Column r) 43

hypot(double l, String rightName) 44

initcap(Column e) 44

input_file_name() 44

instr(Column str, String substring) 44

isnan(Column e) 45

isnull(Column e) 45

json_tuple(Column json, String... fields) 45

json_tuple(Column json, scala.collection.Seq<String> fields) 45

kurtosis(Column e) 45

kurtosis(String columnName) 46

lag(Column e, int offset) 46

lag(Column e, int offset, Object defaultValue) 46

lag(String columnName, int offset) 47

lag(String columnName, int offset, Object defaultValue) 47

last(Column e) 47

last(Column e, boolean ignoreNulls) 48

last(String columnName) 48

last(String columnName, boolean ignoreNulls) 48

last_day(Column e) 48

lead(Column e, int offset) 49

lead(Column e, int offset, Object defaultValue) 49

lead(String columnName, int offset) 49

lead(String columnName, int offset, Object defaultValue) 50

least(Column... exprs) 50

least(String columnName, String... columnNames) 50

least(String columnName, scala.collection.Seq<String> columnNames) 51

least(scala.collection.Seq<Column> exprs) 51

length(Column e) 51

x CONTENTS

levenshtein(Column l, Column r) 51

lit(Object literal) 52

locate(String substr, Column str) 52

locate(String substr, Column str, int pos) 52

log(Column e) 52

log(String columnName) 53

log(double base, Column a) 53

log(double base, String columnName) 53

log10(Column e) 53

log10(String columnName) 53

log1p(Column e) 54

log1p(String columnName) 54

log2(Column expr) 54

log2(String columnName) 54

lower(Column e) 54

lpad(Column str, int len, String pad) 55

ltrim(Column e) 55

ltrim(Column e, String trimString) 55

map(Column... cols) 55

map(scala.collection.Seq<Column> cols) 56

map_keys(Column e) 56

map_values(Column e) 56

max(Column e) 56

max(String columnName) 56

md5(Column e) 57

mean(Column e) 57

mean(String columnName) 57

min(Column e) 57

min(String columnName) 57

minute(Column e) 58

monotonicallyIncreasingId() 58

monotonically_increasing_id() 58

month(Column e) 59

xiCONTENTS

months_between(Column date1, Column date2) 59

nanvl(Column col1, Column col2) 59

negate(Column e) 59

next_day(Column date, String dayOfWeek) 60

not(Column e) 60

ntile(int n) 60

percent_rank() 61

pmod(Column dividend, Column divisor) 61

posexplode(Column e) 61

posexplode_outer(Column e) 61

pow(Column l, Column r) 62

pow(Column l, String rightName) 62

pow(Column l, double r) 62

pow(String leftName, Column r) 62

pow(String leftName, String rightName) 63

pow(String leftName, double r) 63

pow(double l, Column r) 63

pow(double l, String rightName) 63

quarter(Column e) 64

radians(Column e) 64

radians(String columnName) 64

rand() 64

rand(long seed) 64

randn() 65

randn(long seed) 65

rank() 65

regexp_extract(Column e, String exp, int groupIdx) 65

regexp_replace(Column e, Column pattern, Column replacement) 66

regexp_replace(Column e, String pattern, String replacement) 66

repeat(Column str, int n) 66

reverse(Column str) 66

rint(Column e) 67

rint(String columnName) 67

xii CONTENTS

round(Column e) 67

round(Column e, int scale) 67

row_number() 68

rpad(Column str, int len, String pad) 68

rtrim(Column e) 68

rtrim(Column e, String trimString) 68

second(Column e) 68

sha1(Column e) 69

sha2(Column e, int numBits) 69

shiftLeft(Column e, int numBits) 69

shiftRight(Column e, int numBits) 69

shiftRightUnsigned(Column e, int numBits) 70

signum(Column e) 70

signum(String columnName) 70

sin(Column e) 70

sin(String columnName) 71

sinh(Column e) 71

sinh(String columnName) 71

size(Column e) 71

skewness(Column e) 71

skewness(String columnName) 71

sort_array(Column e) 72

sort_array(Column e, boolean asc) 72

soundex(Column e) 72

spark_partition_id() 72

split(Column str, String pattern) 73

sqrt(Column e) 73

sqrt(String colName) 73

stddev(Column e) 73

stddev(String columnName) 73

stddev_pop(Column e) 74

stddev_pop(String columnName) 74

stddev_samp(Column e) 74

xiiiCONTENTS

stddev_samp(String columnName) 74

struct(Column... cols) 74

struct(String colName, String... colNames) 75

struct(String colName, scala.collection.Seq<String> colNames) 75

struct(scala.collection.Seq<Column> cols) 75

substring(Column str, int pos, int len) 75

substring_index(Column str, String delim, int count) 76

sum(Column e) 76

sum(String columnName) 76

sumDistinct(Column e) 76

sumDistinct(String columnName) 77

tan(Column e) 77

tan(String columnName) 77

tanh(Column e) 77

tanh(String columnName) 77

toDegrees(Column e) 78

toDegrees(String columnName) 78

toRadians(Column e) 78

toRadians(String columnName) 78

to_date(Column e) 78

to_date(Column e, String fmt) 79

to_json(Column e) 79

to_json(Column e, java.util.Map<String,String> options) 79

to_json(Column e, scala.collection.immutable.Map<String,String> options) 79

to_timestamp(Column s) 80

to_timestamp(Column s, String fmt) 80

to_utc_timestamp(Column ts, String tz) 80

translate(Column src, String matchingString, String replaceString) 81

trim(Column e) 81

trim(Column e, String trimString) 81

trunc(Column date, String format, format:) 81

xiv CONTENTS

typedLit(T literal, scala.reflect.api.TypeTags.TypeTag<T> evidence$1) 82

udf(Object f, DataType dataType) 82

udf(UDF0<?> f, DataType returnType) 82

udf(UDF10<?,?,?,?,?,?,?,?,?,?,?> f, DataType returnType) 83

udf(UDF1<?,?> f, DataType returnType) 83

udf(UDF2<?,?,?> f, DataType returnType) 83

udf(UDF3<?,?,?,?> f, DataType returnType) 84

udf(UDF4<?,?,?,?,?> f, DataType returnType) 84

udf(UDF5<?,?,?,?,?,?> f, DataType returnType) 84

udf(UDF6<?,?,?,?,?,?,?> f, DataType returnType) 85

udf(UDF7<?,?,?,?,?,?,?,?> f, DataType returnType) 85

udf(UDF8<?,?,?,?,?,?,?,?,?> f, DataType returnType) 85

udf(UDF9<?,?,?,?,?,?,?,?,?,?> f, DataType returnType) 86

udf(scala.Function0<RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$2) 86

udf(scala.Function10<A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$57, scala.reflect.api.TypeTags.TypeTag<A1> evidence$58, scala.reflect.api.TypeTags.TypeTag<A2> evidence$59, scala.reflect.api.TypeTags.TypeTag<A3> evidence$60, scala.reflect.api.TypeTags.TypeTag<A4> evidence$61, scala.reflect.api.TypeTags.TypeTag<A5> evidence$62, scala.reflect.api.TypeTags.TypeTag<A6> evidence$63, scala.reflect.api.TypeTags.TypeTag<A7> evidence$64, scala.reflect.api.TypeTags.TypeTag<A8> evidence$65, scala.reflect.api.TypeTags.TypeTag<A9> evidence$66, scala.reflect.api.TypeTags.TypeTag<A10> evidence$67) 87

udf(scala.Function1<A1,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$3, scala.reflect.api.TypeTags.TypeTag<A1> evidence$4) 88

udf(scala.Function2<A1,A2,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$5, scala.reflect.api.TypeTags.TypeTag<A1> evidence$6, scala.reflect.api.TypeTags.TypeTag<A2> evidence$7) 88

udf(scala.Function3<A1,A2,A3,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$8, scala.reflect.api.TypeTags.TypeTag<A1> evidence$9, scala.reflect.api.TypeTags.TypeTag<A2> evidence$10, scala.reflect.api.TypeTags.TypeTag<A3> evidence$11) 89

xvCONTENTS

udf(scala.Function4<A1,A2,A3,A4,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$12, scala.reflect.api.TypeTags.TypeTag<A1> evidence$13, scala.reflect.api.TypeTags.TypeTag<A2> evidence$14, scala.reflect.api.TypeTags.TypeTag<A3> evidence$15, scala.reflect.api.TypeTags.TypeTag<A4> evidence$16) 89

udf(scala.Function5<A1,A2,A3,A4,A5,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$17, scala.reflect.api.TypeTags.TypeTag<A1> evidence$18, scala.reflect.api.TypeTags.TypeTag<A2> evidence$19, scala.reflect.api.TypeTags.TypeTag<A3> evidence$20, scala.reflect.api.TypeTags.TypeTag<A4> evidence$21, scala.reflect.api.TypeTags.TypeTag<A5> evidence$22) 90

udf(scala.Function6<A1,A2,A3,A4,A5,A6,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$23, scala.reflect.api.TypeTags.TypeTag<A1> evidence$24, scala.reflect.api.TypeTags.TypeTag<A2> evidence$25, scala.reflect.api.TypeTags.TypeTag<A3> evidence$26, scala.reflect.api.TypeTags.TypeTag<A4> evidence$27, scala.reflect.api.TypeTags.TypeTag<A5> evidence$28, scala.reflect.api.TypeTags.TypeTag<A6> evidence$29) 91

udf(scala.Function7<A1,A2,A3,A4,A5,A6,A7,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$30, scala.reflect.api.TypeTags.TypeTag<A1> evidence$31, scala.reflect.api.TypeTags.TypeTag<A2> evidence$32, scala.reflect.api.TypeTags.TypeTag<A3> evidence$33, scala.reflect.api.TypeTags.TypeTag<A4> evidence$34, scala.reflect.api.TypeTags.TypeTag<A5> evidence$35, scala.reflect.api.TypeTags.TypeTag<A6> evidence$36, scala.reflect.api.TypeTags.TypeTag<A7> evidence$37) 91

udf(scala.Function8<A1,A2,A3,A4,A5,A6,A7,A8,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$38, scala.reflect.api.TypeTags.TypeTag<A1> evidence$39, scala.reflect.api.TypeTags.TypeTag<A2> evidence$40, scala.reflect.api.TypeTags.TypeTag<A3> evidence$41, scala.reflect.api.TypeTags.TypeTag<A4> evidence$42, scala.reflect.api.TypeTags.TypeTag<A5> evidence$43, scala.reflect.api.TypeTags.TypeTag<A6> evidence$44, scala.reflect.api.TypeTags.TypeTag<A7> evidence$45, scala.reflect.api.TypeTags.TypeTag<A8> evidence$46) 92

udf(scala.Function9<A1,A2,A3,A4,A5,A6,A7,A8,A9,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$47, scala.reflect.api.TypeTags.TypeTag<A1> evidence$48, scala.reflect.api.TypeTags.TypeTag<A2> evidence$49,

xvi CONTENTS

scala.reflect.api.TypeTags.TypeTag<A3> evidence$50, scala.reflect.api.TypeTags.TypeTag<A4> evidence$51, scala.reflect.api.TypeTags.TypeTag<A5> evidence$52, scala.reflect.api.TypeTags.TypeTag<A6> evidence$53, scala.reflect.api.TypeTags.TypeTag<A7> evidence$54, scala.reflect.api.TypeTags.TypeTag<A8> evidence$55, scala.reflect.api.TypeTags.TypeTag<A9> evidence$56) 93

unbase64(Column e) 94

unboundedFollowing() 94

unboundedPreceding() 95

unhex(Column column) 95

unix_timestamp() 95

unix_timestamp(Column s) 95

unix_timestamp(Column s, String p) 95

upper(Column e) 96

var_pop(Column e) 96

var_pop(String columnName) 96

var_samp(Column e) 96

var_samp(String columnName) 96

variance(Column e) 97

variance(String columnName) 97

weekofyear(Column e) 97

when(Column condition, Object value) 97

window(Column timeColumn, String windowDuration) 98

window(Column timeColumn, String windowDuration, String slideDuration) 98

window(Column timeColumn, String windowDuration, String slideDuration, String startTime) 99

year(Column e) 100

Chapter 1

Static functions ease your transformations

Static functions are a fantastic help when you are performing transformations. They help you transform your data within the dataframe.

This guide is designed as a comprehensive reference to be used to find the func-tions you will need.

The first part contains the list of functions per category and the second part contains the definition of each function, like in a JavaDoc.

This guide is specific to Apache Spark version 2.3.4. Specific guides for other versions are available on Manning’s website.

There are 348 functions. I’ve classified them in the following categories:

Popular functions: frequently used functions. Aggregate functions: perform data aggregations. Arithmetical functions: perform simple and complex arithmetical operations. Array manipulation functions: perform array operations. Binary operations: perform binary-level operations. Comparison functions: perform comparisons. Compute function: perform computation from a SQL-like statement. Conditional operations: perform conditional evaluations. Conversion functions: perform data and type conversions. Data shape functions: perform operations relating to modifying the shape of

the data. Date and time functions: perform date and time manipulations and conver-

sions. Digest functions: calculate digests on columns. Encoding functions: perform encoding/decoding. Formatting functions: perform string and number formatting.

1

2 Static functions ease your transformations

JSON (JavaScript object notation) functions: transform to and from JSON doc-uments and fragments.

List functions: perform data collection operations on lists. Mathematical functions: perform mathematical operation on columns. Check out

the mathematics subcategories as well: trigonometry, arithmetics, and statistics. Navigation functions: allow referencing of columns. Rounding functions: perform rounding operations on numerical values. Sorting functions: perform column sorting. Statistical functions: perform statistical operations. Streaming functions: perform window/streaming operations. String functions: perform common string operations. Technical functions: inform on dataframe technical/meta information. Trigonometry functions: perform trigonometric calculations. UDFs (user-defined functions) helpers: provide help with manipulating UDFs. Validation functions: perform value type validation. Deprecated functions.

1.1 Functions per categoryThis section lists, per category, all the functions. Some functions can be in several cat-egories, which is typically the case for mathematical functions, which are subdivided in arithmetic, trigonometry, and more. Functions will be listed in each category and subcategory: they may appear several times.

1.1.1 Popular functions

These functions are very popular. The popularity is probably very subjective: these are functions my teams and I use a lot and are frequently queried about on Stack Overflow.

There are six functions in this category: col(), concat(), expr(), lit(), split(), and to_date().

1.1.2 Aggregate functions

Aggregate functions allow you to perform a calculation on a set of values and return a single scalar value. In SQL, developers often use aggregate functions with the GROUP BY and HAVING clauses of SELECT statements.

There are 25 functions in this category: approx_count_distinct(), col-

lect_list(), collect_set(), corr(), count(), countDistinct(), covar_pop(), covar_samp(), first(), grouping(), grouping_id(), kurtosis(), last(), max(), mean(), min(), skewness(), stddev(), stddev_pop(), stddev_samp(), sum(), sum-Distinct(), var_pop(), var_samp(), and variance().

3Functions per category

1.1.3 Arithmetical functions

Arithmetical functions perform operations like computing square roots. There are 13 functions in this category: cbrt(), exp(), expm1(), factorial(),

hypot(), log(), log10(), log1p(), log2(), negate(), pmod(), pow(), and sqrt().

1.1.4 Array manipulation functions

Array functions manipulate arrays when they are in a dataframe’s cell. There are five functions in this category: array(), array_contains(), reverse(),

size(), and sort_array().

1.1.5 Binary operations

Thanks to binary functions, you can perform binary-level operations, like binary not, shifting bits, and similar operations.

There are five functions in this category: bitwiseNOT(), not(), shiftLeft(), shiftRight(), and shiftRightUnsigned().

1.1.6 Comparison functions

Comparison functions are used to compare values. There are two functions in this category: greatest()and least().

1.1.7 Compute function

This function is used to compute values from a statement. The statement itself is SQL-like.

There is one function in this category: expr().

1.1.8 Conditional operations

Conditional functions are used to evaluate values on a conditional basis. There are two functions in this category: nanvl()and when().

1.1.9 Conversion functions

Conversion functions are used for converting various data into other types: date, JSON, hexadecimal, and more.

There are 12 functions in this category: conv(), date_format(), from_json(), from_unixtime(), from_utc_timestamp(), get_json_object(), hex(), to_date(), to_json(), to_timestamp(), to_utc_timestamp(), and unhex().

1.1.10 Data shape functions

These functions modify the data shape like creating a column with a literal value (lit()), flattening, mapping, and more.

There are 12 functions in this category: coalesce(), explode(), explode _outer(), lit(), map(), map_keys(), map_values(), monotonically_increasing_ id(), posexplode(), posexplode_outer(), struct(), and typedLit().


1.1.11 Date and time functions

Date and time functions manipulate dates, time, and their combinations, like finding the current date (current_date()), adding days/months/years to a date, and more.

There are 28 functions in this category: add_months(), current_date(), current_ timestamp(), date_add(), date_format(), date_sub(), date_trunc(), datediff(), dayofmonth(), dayofweek(), dayofyear(), from_unixtime(), from_utc_time stamp (), hour(), last_day(), minute(), month(), months_between(), next_day(), quarter (), second(), to_date(), to_timestamp(), to_utc_timestamp(), trunc(), unix_time stamp(), weekofyear(), and year().

1.1.12 Digest functions

Digest functions create digests from values in other columns. Digests can be MD5 (md5()), SHA1/2, and more.

There are seven functions in this category: base64(), crc32(), hash(), md5(), sha1(), sha2(), and unbase64().

1.1.13 Encoding functions

Encoding functions can manipulate encodings. There are three functions in this category: base64(), decode(), and encode().

1.1.14 Formatting functions

Formatting functions format strings and numbers in a specified way. There are two functions in this category: format_number()and format_string().

1.1.15 JSON (JavaScript object notation) functions

JSON functions help the conversion and JSON manipulation functions. There are four functions in this category: from_json(), get_json_object(),

json_tuple(), and to_json().

1.1.16 List functions

With list functions, you can manipulate lists through collecting the data. The meaning of the data collected is based on the dataset/dataframe’s collect() method, chapter 16 explains collect() and collectAsList().

There are two functions in this category: collect_list()and collect_set().

1.1.17 Mathematical functions

The range of mathematical functions is broad, with subcategories in trigonometry, arith-metic, statistics, and more. They usually behave like their java.lang.Math counterparts.

There are 37 functions in this category: abs(), acos(), asin(), atan(), atan2(), avg(), bround(), cbrt(), ceil(), cos(), cosh(), covar_pop(), covar_samp(), degrees(), exp(), expm1(), factorial(), floor(), hypot(), log(), log10(), log1p(), log2(), negate(), pmod(), pow(), radians(), rand(), randn(), rint(), round(), signum(), sin(), sinh(), sqrt(), tan(), and tanh().

5Functions per category

1.1.18 Navigation functions

Navigation functions perform navigation or referencing within the dataframe. There are four functions in this category: col(), column(), first(), and last().

1.1.19 Rounding functions

Rounding functions perform rounding of numerical values. There are five functions in this category: bround(), ceil(), floor(), rint(), and

round().

1.1.20 Sorting functions

Sorting functions are used for sorting of elements within a column. There are 11 functions in this category: asc(), asc_nulls_first(), asc_nulls

_last(), desc(), desc_nulls_first(), desc_nulls_last(), greatest(), least(), max(), min(), and sort_array().

1.1.21 Statistical functions

Statistical functions cover statistics like calculating averages, variances, and more. They are often used in the context of window/streaming or aggregates.

There are 11 functions in this category: avg(), covar_pop(), covar_samp(), cume_dist(), mean(), stddev(), stddev_pop(), stddev_samp(), var_pop(), var_ samp(), and variance().

1.1.22 Streaming functions

Streaming functions are used in the context of window/streaming operations. There are nine functions in this category: cume_dist(), dense_rank(), lag(),

lead(), ntile(), percent_rank(), rank(), row_number(), and window().

1.1.23 String functions

String functions allow manipulation of strings, like concatenation, extraction and replacement based on regex, and more.

There are 30 functions in this category: ascii(), bin(), concat(), concat_ws(), date_format(), date_trunc(), format_number(), format_string(), get_json_

object(), initcap(), instr(), length(), levenshtein(), locate(), lower(), lpad(), ltrim(), regexp_extract(), regexp_replace(), repeat(), reverse(), rpad(), rtrim(), soundex(), split(), substring(), substring_index(), trans late(), trim(), and upper().

1.1.24 Technical functions

Technical technical functions give you meta information on the dataframe and its structure.

There are five functions in this category: broadcast(), col(), column(), input_ file_name(), and spark_partition_id().


1.1.25 Trigonometry functions

Trigonometry functions perform operations such as sine, cosine, and more. There are 12 functions in this category: acos(), asin(), atan(), atan2(), cos(),

cosh(), degrees(), radians(), sin(), sinh(), tan(), and tanh().

1.1.26 UDFs (user-defined functions) helpers

UDFs are functions in their own right. They extend Apache Spark. However, to use the UDF in transformation, you will need these helper functions. Using and building UDFs are covered in chapter 14. The counterpart to UDF for aggregations are UDAFs (user-defined aggregate functions), detailed in chapter 15.

There are two functions in this category: callUDF()and udf().

1.1.27 Validation functions

Validation functions allow you to test for a value’s status, like if it’s NaN (not a num-ber) or null.

There are two functions in this category: isnan()and isnull().

1.1.28 Deprecated functions

These functions are still available, but are deprecated. If you are using them, check their replacement at https://spark.apache.org/docs/2.3.4/api/java/org/apache/ spark/sql/functions.html.

There are four functions in this category: approxCountDistinct(), monotonical lyIncreasingId(), toDegrees(), and toRadians().

1.2 Functions appearance per version of SparkThis section lists all the functions in reverse order of appearance per version of Apache Spark.

1.2.1 Functions appeared in Spark v2.3.0

There are 12 functions in this category: currentRow(), date_trunc(), dayofweek(), from_json(), ltrim(), map_keys(), map_values(), rtrim(), trim(), udf(), unboun dedFollowing(), and unboundedPreceding().


There are six functions in this category: explode_outer(), from_json(), posex plode_outer(), to_date(), to_timestamp(), and typedLit().


There are 11 functions in this category: approx_count_distinct(), asc_nulls_ first(), asc_nulls_last(), degrees(), desc_nulls_first(), desc_nulls_last(), from_json(), posexplode(), radians(), regexp_replace(), and to_json().

https://spark.apache.org/docs/2.3.4/api/java/org/apache/spark/sql/functions.html




7Reference for functions


There are ten functions in this category: bround(), covar_pop(), covar_samp(), first(), grouping(), grouping_id(), hash(), last(), udf(), and window().


There are 22 functions in this category: collect_list(), collect_set(), corr(), cume_dist(), dense_rank(), get_json_object(), input_file_name(), isnan(), isnull(), json_tuple(), kurtosis(), monotonically_increasing_id(), percent _rank(), row_number(), skewness(), spark_partition_id(), stddev(), stddev_ pop(), stddev_samp(), var_pop(), var_samp(), and variance().


There are 76 functions in this category: add_months(), array_contains(), ascii(), base64(), bin(), broadcast(), callUDF(), concat(), concat_ws(), conv(), crc32(), current_date(), current_timestamp(), date_add(), date_format(), date_sub(), datediff(), dayofmonth(), dayofyear(), decode(), encode(), factorial(), format _number(), format_string(), from_unixtime(), from_utc_timestamp(), great

est(), hex(), hour(), initcap(), instr(), last_day(), least(), length(), leven shtein(), locate(), log2(), lpad(), ltrim(), md5(), minute(), month(), months_ between(), nanvl(), next_day(), pmod(), quarter(), regexp_extract(), regexp_ replace(), repeat(), reverse(), round(), rpad(), rtrim(), second(), sha1(), sha2(), shiftLeft(), shiftRight(), shiftRightUnsigned(), size(), sort_array(), soundex(), split(), sqrt(), substring(), to_date(), to_utc_timestamp(), trans late(), trim(), trunc(), unbase64(), unhex(), unix_timestamp(), weekofyear(), and year().


There are 36 functions in this category: acos(), array(), asin(), atan(), atan2(), bitwiseNOT(), cbrt(), ceil(), cos(), cosh(), exp(), expm1(), floor(), hypot(), lag(), lead(), log(), log10(), log1p(), mean(), monotonicallyIncreasingId(), ntile(), pow(), rand(), randn(), rank(), rint(), signum(), sin(), sinh(), struct(), tan(), tanh(), toDegrees(), toRadians(), and when().


There are 24 functions in this category: abs(), approxCountDistinct(), asc(), avg(), coalesce(), col(), column(), count(), countDistinct(), desc(), explode(), first(), last(), lit(), lower(), max(), min(), negate(), not(), sqrt(), sum(), sum Distinct(), udf(), and upper().

1.3 Reference for functionsThis section lists all the functions in alphabetical order, including their complete signature. Use this as a reference. The online reference can be found at


http://jgp.net/functions and https://spark.apache.org/docs/latest/api/java/org/ apache/spark/sql/functions.html.

1.3.1 abs(Column e)

Computes the absolute value. Signature: Column abs(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics.

1.3.2 acos(Column e)

Returns the arc cosine of a value; the returned angle is in the range 0.0 through pi. Signature: Column acos(Column e). Parameter: Column e. Returns: Column inverse cosine of e in radians, as if computed by java.lang

.Math.acos. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.3 acos(String columnName)

Returns the arc cosine of a value; the returned angle is in the range 0.0 through pi. Signature: Column acos(String columnName). Parameter: String columnName. Returns: Column inverse cosine of columnName, as if computed by java.lang

.Math.acos. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.4 add_months(Column startDate, int numMonths)

Returns the date that is numMonths after startDate. Signature: Column add_months(Column startDate, int numMonths). Parameters:

Column startDate. int numMonths.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.5 approxCountDistinct(Column e)

Deprecated. Use approx_count_distinct. Since 2.1.0. Signature: Column approxCountDistinct(Column e).

https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html



http://jgp.net/functions



Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. Function has been deprecated in Spark v2.1.0 and is replaced by approx_count

_distinct. This method is classified in: deprecated.

1.3.6 approxCountDistinct(Column e, double rsd)

Deprecated. Use approx_count_distinct. Since 2.1.0. Signature: Column approxCountDistinct(Column e, double rsd). Parameters:

Column e. double rsd.

Returns: Column. Appeared in Apache Spark v1.3.0. Function has been deprecated in Spark v2.1.0 and is replaced by approx_count


1.3.7 approxCountDistinct(String columnName)

Deprecated. Use approx_count_distinct. Since 2.1.0. Signature: Column approxCountDistinct(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. Function has been deprecated in Spark v2.1.0 and is replaced by approx_count


1.3.8 approxCountDistinct(String columnName, double rsd)

Deprecated. Use approx_count_distinct. Since 2.1.0. Signature: Column approxCountDistinct(String columnName, double rsd). Parameters:

String columnName. double rsd.

Returns: Column. Appeared in Apache Spark v1.3.0. Function has been deprecated in Spark v2.1.0 and is replaced by approx_count



1.3.9 approx_count_distinct(Column e)

Aggregate function: returns the approximate number of distinct items in a group. Signature: Column approx_count_distinct(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: aggregate.

1.3.10 approx_count_distinct(Column e, double rsd)

Aggregate function: returns the approximate number of distinct items in a group. Signature: Column approx_count_distinct(Column e, double rsd). Parameters:

Column e. double rsd maximum estimation error allowed (default = 0.05).

Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: aggregate.

1.3.11 approx_count_distinct(String columnName)

Aggregate function: returns the approximate number of distinct items in a group. Signature: Column approx_count_distinct(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: aggregate.

1.3.12 approx_count_distinct(String columnName, double rsd)

Aggregate function: returns the approximate number of distinct items in a group. Signature: Column approx_count_distinct(String columnName, double rsd). Parameters:

String columnName. double rsd maximum estimation error allowed (default = 0.05).


1.3.13 array(Column... cols)

Creates a new array column. The input columns must all have the same data type. Signature: Column array(Column... cols). Parameter: Column... cols. Returns: Column.


Appeared in Apache Spark v1.4.0. This method is classified in: array.

1.3.14 array(String colName, String... colNames)

Creates a new array column. The input columns must all have the same data type. Signature: Column array(String colName, String... colNames). Parameters:

String colName. String... colNames.

Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: array.

1.3.15 array(String colName, scala.collection.Seq<String> colNames)

Creates a new array column. The input columns must all have the same data type. Signature: Column array(String colName, scala.collection.Seq<String>

colNames). Parameters:

String colName. scala.collection.Seq<String> colNames.


1.3.16 array(scala.collection.Seq<Column> cols)

Creates a new array column. The input columns must all have the same data type. Signature: Column array(scala.collection.Seq<Column> cols). Parameter: scala.collection.Seq<Column> cols. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: array.

1.3.17 array_contains(Column column, Object value)

Returns null if the array is null, true if the array contains value, and false otherwise. Signature: Column array_contains(Column column, Object value). Parameters:

Column column. Object value.



1.3.18 asc(String columnName)

Returns a sort expression based on ascending order of the column.

df.sort(asc("dept"), desc("age")).

Signature: Column asc(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: sorting.

1.3.19 asc_nulls_first(String columnName)

Returns a sort expression based on ascending order of the column, and null values return before non-null values.

df.sort(asc_nulls_last("dept"), desc("age")).

Signature: Column asc_nulls_first(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: sorting.

1.3.20 asc_nulls_last(String columnName)

Returns a sort expression based on ascending order of the column, and null values appear after non-null values.

df.sort(asc_nulls_last("dept"), desc("age")).

Signature: Column asc_nulls_last(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: sorting.

1.3.21 ascii(Column e)

Computes the numeric value of the first character of the string column, and returns the result as an int column.

Signature: Column ascii(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.


1.3.22 asin(Column e)

Returns the arc sine of a value; the returned angle is in the range -pi/2 through pi/2. Signature: Column asin(Column e). Parameter: Column e. Returns: Column inverse sine of e in radians, as if computed by java.lang.Math.asin. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.23 asin(String columnName)

Returns the arc sine of a value; the returned angle is in the range -pi/2 through pi/2. Signature: Column asin(String columnName). Parameter: String columnName. Returns: Column inverse sine of columnName, as if computed by java.lang.Math.asin. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.24 atan(Column e)

Returns the arc tangent of a value; the returned angle is in the range -pi/2 through pi/2.

Signature: Column atan(Column e). Parameter: Column e. Returns: Column inverse tangent of e, as if computed by java.lang.Math.atan. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.25 atan(String columnName)

Returns the arc tangent of a value; the returned angle is in the range -pi/2 through pi/2.

Signature: Column atan(String columnName). Parameter: String columnName. Returns: Column inverse tangent of columnName, as if computed by java.lang

.Math.atan. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.26 atan2(Column y, Column x)

Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).

Signature: Column atan2(Column y, Column x). Parameters:

Column y coordinate on y-axis. Column x coordinate on x-axis.


Returns: Column the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2.

Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.27 atan2(Column y, String xName)


Signature: Column atan2(Column y, String xName). Parameters:

Column y coordinate on y-axis. String xName coordinate on x-axis.



1.3.28 atan2(Column y, double xValue)


Signature: Column atan2(Column y, double xValue). Parameters:

Column y coordinate on y-axis. double xValue coordinate on x-axis.



1.3.29 atan2(String yName, Column x)


Signature: Column atan2(String yName, Column x). Parameters:

String yName coordinate on y-axis. Column x coordinate on x-axis.




1.3.30 atan2(String yName, String xName)


Signature: Column atan2(String yName, String xName). Parameters:

String yName coordinate on y-axis. String xName coordinate on x-axis.



1.3.31 atan2(String yName, double xValue)


Signature: Column atan2(String yName, double xValue). Parameters:

String yName coordinate on y-axis. double xValue coordinate on x-axis.



1.3.32 atan2(double yValue, Column x)


Signature: Column atan2(double yValue, Column x). Parameters:

double yValue coordinate on y-axis. Column x coordinate on x-axis.




1.3.33 atan2(double yValue, String xName)


Signature: Column atan2(double yValue, String xName). Parameters:

double yValue coordinate on y-axis. String xName coordinate on x-axis.



1.3.34 avg(Column e)

Aggregate function: returns the average of the values in a group. Signature: Column avg(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics and statistics.

1.3.35 avg(String columnName)

Aggregate function: returns the average of the values in a group. Signature: Column avg(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics and statistics.

1.3.36 base64(Column e)

Computes the BASE64 encoding of a binary column and returns it as a string column. This is the reverse of unbase64.

Signature: Column base64(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: encoding and digest.

1.3.37 bin(Column e)

An expression that returns the string representation of the binary value of the given long column. For example, bin(“12”) returns “1100”.


Signature: Column bin(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.38 bin(String columnName)

An expression that returns the string representation of the binary value of the given long column. For example, bin(“12”) returns “1100”.

Signature: Column bin(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.39 bitwiseNOT(Column e)

Computes bitwise NOT. Signature: Column bitwiseNOT(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: binary.

1.3.40 broadcast(Dataset<T> df)

Marks a DataFrame as small enough for use in broadcast joins. The following example marks the right DataFrame for broadcast hash join using

joinKey.

// left and right are DataFramesleft.join(broadcast(right), "joinKey").

Signature: Dataset<T> broadcast(Dataset<T> df). Parameter: Dataset<T> df. Returns: Dataset<T>. Appeared in Apache Spark v1.5.0. This method is classified in: technical.

1.3.41 bround(Column e)

Returns the value of the column e rounded to 0 decimal places with HALF_EVEN round mode.

Signature: Column bround(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: mathematics and rounding.


1.3.42 bround(Column e, int scale)

Round the value of e to scale decimal places with HALF_EVEN round mode if scaleis greater than or equal to 0 or at integral part when scale is less than 0.

Signature: Column bround(Column e, int scale). Parameters:

Column e. int scale.

Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: mathematics and rounding.

1.3.43 callUDF(String udfName, Column... cols)

Call an user-defined function. Example:

import org.apache.spark.sql._ al df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")al spark = df.sparkSessionpark.udf.register("simpleUDF", (v: Int) => v * v)f.select($"id", callUDF("simpleUDF", $"value")).

Signature: Column callUDF(String udfName, Column... cols). Parameters:

String udfName. Column... cols.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: udf.

1.3.44 callUDF(String udfName, scala.collection.Seq<Column> cols)

Call an user-defined function. Example:

import org.apache.spark.sql._ al df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")al spark = df.sparkSessionpark.udf.register("simpleUDF", (v: Int) => v * v)f.select($"id", callUDF("simpleUDF", $"value")).

Signature: Column callUDF(String udfName, scala.collection.Seq<Column> cols).

Parameters:

String udfName. scala.collection.Seq<Column> cols.

Returns: Column.


Appeared in Apache Spark v1.5.0. This method is classified in: udf.

1.3.45 cbrt(Column e)

Computes the cube-root of the given value. Signature: Column cbrt(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.46 cbrt(String columnName)

Computes the cube-root of the given column. Signature: Column cbrt(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.47 ceil(Column e)

Computes the ceiling of the given value. Signature: Column ceil(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and rounding.

1.3.48 ceil(String columnName)

Computes the ceiling of the given column. Signature: Column ceil(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and rounding.

1.3.49 coalesce(Column... e)

Returns the first column that is not null, or null if all inputs are null. For example, coalesce(a, b, c) will return a if a is not null, or b if a is null and b

is not null, or c if both a and b are null but c is not null. Signature: Column coalesce(Column... e). Parameter: Column... e. Returns: Column.


Appeared in Apache Spark v1.3.0. This method is classified in: datashape.

1.3.50 coalesce(scala.collection.Seq<Column> e)

Returns the first column that is not null, or null if all inputs are null. For example, coalesce(a, b, c) will return a if a is not null, or b if a is null and b

is not null, or c if both a and b are null but c is not null. Signature: Column coalesce(scala.collection.Seq<Column> e). Parameter: scala.collection.Seq<Column> e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: datashape.

1.3.51 col(String colName)

Returns a Column based on the given column name. Signature: Column col(String colName). Parameter: String colName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: technical, navigation, and popular.

1.3.52 collect_list(Column e)

Aggregate function: returns a list of objects with duplicates. Signature: Column collect_list(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and list.

1.3.53 collect_list(String columnName)

Aggregate function: returns a list of objects with duplicates. Signature: Column collect_list(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and list.

1.3.54 collect_set(Column e)

Aggregate function: returns a set of objects with duplicate elements eliminated. Signature: Column collect_set(Column e). Parameter: Column e. Returns: Column.


Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and list.

1.3.55 collect_set(String columnName)

Aggregate function: returns a set of objects with duplicate elements eliminated. Signature: Column collect_set(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and list.

1.3.56 column(String colName)

Returns a Column based on the given column name. Alias of col. Signature: Column column(String colName). Parameter: String colName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: technical and navigation.

1.3.57 concat(Column... exprs)

Concatenates multiple input columns together into a single column. If all inputs are binary, concat returns an output as binary. Otherwise, it returns as string.

Signature: Column concat(Column... exprs). Parameter: Column... exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and popular.

1.3.58 concat(scala.collection.Seq<Column> exprs)

Concatenates multiple input columns together into a single column. If all inputs are binary, concat returns an output as binary. Otherwise, it returns as string.

Signature: Column concat(scala.collection.Seq<Column> exprs). Parameter: scala.collection.Seq<Column> exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and popular.

1.3.59 concat_ws(String sep, Column... exprs)

Concatenates multiple input string columns together into a single string column, using the given separator.

Signature: Column concat_ws(String sep, Column... exprs).


Parameters:

String sep. Column... exprs.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.60 concat_ws(String sep, scala.collection.Seq<Column> exprs)

Concatenates multiple input string columns together into a single string column, using the given separator.

Signature: Column concat_ws(String sep, scala.collection.Seq<Column>

exprs). Parameters:

String sep. scala.collection.Seq<Column> exprs.


1.3.61 conv(Column num, int fromBase, int toBase)

Convert a number in a string column from one base to another. Signature: Column conv(Column num, int fromBase, int toBase). Parameters:

Column num. int fromBase. int toBase.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: conversion.

1.3.62 corr(Column column1, Column column2)

Aggregate function: returns the Pearson Correlation Coefficient for two columns. Signature: Column corr(Column column1, Column column2). Parameters:

Column column1. Column column2.



1.3.63 corr(String columnName1, String columnName2)

Aggregate function: returns the Pearson Correlation Coefficient for two columns. Signature: Column corr(String columnName1, String columnName2). Parameters:

String columnName1. String columnName2.


1.3.64 cos(Column e)

Returns the trigonometric cosine of an angle. Signature: Column cos(Column e). Parameter: Column e angle in radians. Returns: Column cosine of the angle, as if computed by java.lang.Math.cos. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.65 cos(String columnName)

Returns the trigonometric cosine of an angle. Signature: Column cos(String columnName). Parameter: String columnName angle in radians. Returns: Column cosine of the angle, as if computed by java.lang.Math.cos. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.66 cosh(Column e)

Returns the hyperbolic cosine of a double value. Signature: Column cosh(Column e). Parameter: Column e hyperbolic angle. Returns: Column hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.67 cosh(String columnName)

Returns the hyperbolic cosine of a double value. Signature: Column cosh(String columnName). Parameter: String columnName hyperbolic angle. Returns: Column hyperbolic cosine of the angle, as if computed by java.lang.Math.cosh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.


1.3.68 count(Column e)

Aggregate function: returns the number of items in a group. Signature: Column count(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.69 count(String columnName)

Aggregate function: returns the number of items in a group. Signature: TypedColumn<Object,Object> count(String columnName). Parameter: String columnName. Returns: TypedColumn<Object,Object>. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.70 countDistinct(Column expr, Column... exprs)

Aggregate function: returns the number of distinct items in a group. Signature: Column countDistinct(Column expr, Column... exprs). Parameters:

Column expr. Column... exprs.


1.3.71 countDistinct(Column expr, scala.collection.Seq<Column> exprs)

Aggregate function: returns the number of distinct items in a group. Signature: Column countDistinct(Column expr, scala.collection.Seq

<Column> exprs). Parameters:

Column expr. scala.collection.Seq<Column> exprs.


1.3.72 countDistinct(String columnName, String... columnNames)

Aggregate function: returns the number of distinct items in a group. Signature: Column countDistinct(String columnName, String... column

Names).


Parameters:

String columnName. String... columnNames.


1.3.73 countDistinct(String columnName, scala.collection.Seq<String> columnNames)

Aggregate function: returns the number of distinct items in a group. Signature: Column countDistinct(String columnName, scala.collection.Seq

<String> columnNames). Parameters:

String columnName. scala.collection.Seq<String> columnNames.


1.3.74 covar_pop(Column column1, Column column2)

Aggregate function: returns the population covariance for two columns. Signature: Column covar_pop(Column column1, Column column2). Parameters:


Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate, mathematics, and statistics.

1.3.75 covar_pop(String columnName1, String columnName2)

Aggregate function: returns the population covariance for two columns. Signature: Column covar_pop(String columnName1, String columnName2). Parameters:




1.3.76 covar_samp(Column column1, Column column2)

Aggregate function: returns the sample covariance for two columns. Signature: Column covar_samp(Column column1, Column column2). Parameters:



1.3.77 covar_samp(String columnName1, String columnName2)

Aggregate function: returns the sample covariance for two columns. Signature: Column covar_samp(String columnName1, String columnName2). Parameters:



1.3.78 crc32(Column e)

Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.

Signature: Column crc32(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest.

1.3.79 cume_dist()

Window function: returns the cumulative distribution of values within a window parti-tion, that is: the fraction of rows that are below the current row.

N = total number of rows in the partitioncumeDist(x) = number of values before (and including) x / N.

Signature: Column cume_dist(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: streaming and statistics.


1.3.80 currentRow()

Window function: returns the special frame boundary that represents the current row in the window partition.

Signature: Column currentRow(). Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in:

1.3.81 current_date()

Returns the current date as a date column. Signature: Column current_date(). Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.82 current_timestamp()

Returns the current timestamp as a timestamp column. Signature: Column current_timestamp(). Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.83 date_add(Column start, int days)

Returns the date that is days days after start. Signature: Column date_add(Column start, int days). Parameters:

Column start. int days.


1.3.84 date_format(Column dateExpr, String format)

Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.

A pattern dd.MM.yyyy would return a string like 18.03.1993. All pattern letters of java.text.SimpleDateFormat can be used.

Signature: Column date_format(Column dateExpr, String format). Parameters:

Column dateExpr. String format.


Returns: Column. Appeared in Apache Spark v1.5.0. Note: Use specialized functions like year whenever possible as they benefit from a

specialized implementation. This method is classified in: datetime, string, and conversion.

1.3.85 date_sub(Column start, int days)

Returns the date that is days days before start. Signature: Column date_sub(Column start, int days). Parameters:

Column start. int days.


1.3.86 date_trunc(String format, Column timestamp, format:)

Returns timestamp truncated to the unit specified by the format. Signature: Column date_trunc(String format, Column timestamp, format:). Parameters:

String format. Column timestamp. format: ‘year’, ‘yyyy’, ‘yy’ for truncate by year, ‘month’, ‘mon’, ‘mm’ for trun-

cate by month, ‘day’, ‘dd’ for truncate by day, Other options are: ‘second’, ‘min-ute’, ‘hour’, ‘week’, ‘month’, ‘quarter’.

Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: datetime and string.

1.3.87 datediff(Column end, Column start)

Returns the number of days from start to end. Signature: Column datediff(Column end, Column start). Parameters:

Column end. Column start.



1.3.88 dayofmonth(Column e)

Extracts the day of the month as an integer from a given date/timestamp/string. Signature: Column dayofmonth(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.89 dayofweek(Column e)

Extracts the day of the week as an integer from a given date/timestamp/string. Signature: Column dayofweek(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: datetime.

1.3.90 dayofyear(Column e)

Extracts the day of the year as an integer from a given date/timestamp/string. Signature: Column dayofyear(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.91 decode(Column value, String charset)

Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null.

Signature: Column decode(Column value, String charset). Parameters:

Column value. String charset.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: encoding.

1.3.92 degrees(Column e)

Converts an angle measured in radians to an approximately equivalent angle mea-sured in degrees.

Signature: Column degrees(Column e). Parameter: Column e angle in radians.


Returns: Column angle in degrees, as if computed by java.lang.Math.toDegrees. Appeared in Apache Spark v2.1.0. This method is classified in: mathematics and trigonometry.

1.3.93 degrees(String columnName)

Converts an angle measured in radians to an approximately equivalent angle mea-sured in degrees.

Signature: Column degrees(String columnName). Parameter: String columnName angle in radians. Returns: Column angle in degrees, as if computed by java.lang.Math.toDegrees. Appeared in Apache Spark v2.1.0. This method is classified in: mathematics and trigonometry.

1.3.94 dense_rank()

Window function: returns the rank of rows within a window partition, without any gaps. The difference between rank and dense_rank is that denseRank leaves no gaps in

ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.

This is equivalent to the DENSE_RANK function in SQL. Signature: Column dense_rank(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: streaming.

1.3.95 desc(String columnName)

Returns a sort expression based on the descending order of the column.

df.sort(asc("dept"), desc("age")).

Signature: Column desc(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: sorting.

1.3.96 desc_nulls_first(String columnName)

Returns a sort expression based on the descending order of the column, and null values appear before non-null values.

df.sort(asc("dept"), desc_nulls_first("age")).

Signature: Column desc_nulls_first(String columnName).


Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: sorting.

1.3.97 desc_nulls_last(String columnName)

Returns a sort expression based on the descending order of the column, and null values appear after non-null values.

df.sort(asc("dept"), desc_nulls_last("age")).

Signature: Column desc_nulls_last(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: sorting.

1.3.98 encode(Column value, String charset)

Computes the first argument into a binary from a string using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null.

Signature: Column encode(Column value, String charset). Parameters:

Column value. String charset.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: encoding.

1.3.99 exp(Column e)

Computes the exponential of the given value. Signature: Column exp(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.100 exp(String columnName)

Computes the exponential of the given column. Signature: Column exp(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.


1.3.101 explode(Column e)

Creates a new row for each element in the given array or map column. Signature: Column explode(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: datashape.

1.3.102 explode_outer(Column e)

Creates a new row for each element in the given array or map column. Unlike explode, if the array/map is null or empty then null is produced.

Signature: Column explode_outer(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: datashape.

1.3.103 expm1(Column e)

Computes the exponential of the given value minus one. Signature: Column expm1(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.104 expm1(String columnName)

Computes the exponential of the given column. Signature: Column expm1(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.105 expr(String expr)

Parses the expression string into the column that it represents, similar to Dataset. selectExpr(java.lang.String...).

// get the number of words of each lengthdf.groupBy(expr("length(word)")).count().

Signature: Column expr(String expr). Parameter: String expr. Returns: Column. This method is classified in: popular and compute.


1.3.106 factorial(Column e)

Computes the factorial of the given value. Signature: Column factorial(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: mathematics and arithmetic.

1.3.107 first(Column e)

Aggregate function: returns the first value in a group. The function by default returns the first values it sees. It will return the first non-null

value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column first(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and navigation.

1.3.108 first(Column e, boolean ignoreNulls)

Aggregate function: returns the first value in a group. The function by default returns the first values it sees. It will return the first non-null

value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column first(Column e, boolean ignoreNulls). Parameters:

Column e. boolean ignoreNulls.

Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate and navigation.

1.3.109 first(String columnName)

Aggregate function: returns the first value of a column in a group. The function by default returns the first values it sees. It will return the first non-null

value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column first(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and navigation.


1.3.110 first(String columnName, boolean ignoreNulls)

Aggregate function: returns the first value of a column in a group. The function by default returns the first values it sees. It will return the first non-null

value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column first(String columnName, boolean ignoreNulls). Parameters:

String columnName. boolean ignoreNulls.


1.3.111 floor(Column e)

Computes the floor of the given value. Signature: Column floor(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and rounding.

1.3.112 floor(String columnName)

Computes the floor of the given column. Signature: Column floor(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and rounding.

1.3.113 format_number(Column x, int d)

Formats numeric column x to a format like ‘#,###,###.##’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.

If d is 0, the result has no decimal point or fractional part. If d is less than 0, the result will be null.

Signature: Column format_number(Column x, int d). Parameters:

Column x. int d.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and formatting.


1.3.114 format_string(String format, Column... arguments)

Formats the arguments in printf-style and returns the result as a string column. Signature: Column format_string(String format, Column... arguments). Parameters:

String format. Column... arguments.


1.3.115 format_string(String format, scala.collection.Seq<Column> arguments)

Formats the arguments in printf-style and returns the result as a string column. Signature: Column format_string(String format, scala.collection.Seq<Col

umn> arguments). Parameters:

String format. scala.collection.Seq<Column> arguments.


1.3.116 from_json(Column e, DataType schema)

Parses a column containing a JSON string into a StructType or ArrayType of Struct Types with the specified schema. Returns null, in the case of an unparseable string.

Signature: Column from_json(Column e, DataType schema). Parameters:

Column e a string column containing JSON data. DataType schema the schema to use when parsing the json string.

Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: conversion and json.

1.3.117 from_json(Column e, DataType schema, java.util.Map<String,String> options)

(Java-specific) Parses a column containing a JSON string into a StructType or Array Type of StructTypes with the specified schema. Returns null, in the case of an unpar-seable string.

Signature: Column from_json(Column e, DataType schema, java.util.Map

<String,String> options).


Parameters:

Column e a string column containing JSON data. DataType schema the schema to use when parsing the json string. java.util.Map<String,String> options options to control how the json is

parsed. accepts the same options and the json data source.


1.3.118 from_json(Column e, DataType schema, scala.collection.immutable.Map<String,String> options)

(Scala-specific) Parses a column containing a JSON string into a StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string.

Signature: Column from_json(Column e, DataType schema, scala.collection .immutable.Map<String,String> options).

Parameters:

Column e a string column containing JSON data. DataType schema the schema to use when parsing the json string. scala.collection.immutable.Map<String,String> options options to con-

trol how the json is parsed. accepts the same options and the json data source.


1.3.119 from_json(Column e, String schema, java.util.Map<String,String> options)

(Java-specific) Parses a column containing a JSON string into a StructType or Array Type of StructTypes with the specified schema. Returns null, in the case of an unpar-seable string.

Signature: Column from_json(Column e, String schema, java.util.Map

<String,String> options). Parameters:

Column e a string column containing JSON data. String schema the schema to use when parsing the json string as a json string.

In Spark 2.1, the user-provided schema has to be in JSON format. Since Spark 2.2, the DDL format is also supported for the schema.

java.util.Map<String,String> options.



1.3.120 from_json(Column e, String schema, scala.collection.immutable.Map<String,String> options)

(Scala-specific) Parses a column containing a JSON string into a StructType or ArrayType of StructTypes with the specified schema. Returns null, in the case of an unparseable string.

Signature: Column from_json(Column e, String schema, scala.collec

tion.immutable.Map<String,String> options). Parameters:

Column e a string column containing JSON data. String schema the schema to use when parsing the json string as a json string,

it could be a JSON format string or a DDL-formatted string. scala.collection.immutable.Map<String,String> options.


1.3.121 from_json(Column e, StructType schema)

Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

Signature: Column from_json(Column e, StructType schema). Parameters:

Column e a string column containing JSON data. StructType schema the schema to use when parsing the json string.


1.3.122 from_json(Column e, StructType schema, java.util.Map<String,String> options)

(Java-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

Signature: Column from_json(Column e, StructType schema, java.util.Map <String,String> options).

Parameters:

Column e a string column containing JSON data. StructType schema the schema to use when parsing the json string. java.util.Map<String,String> options options to control how the json is

parsed. accepts the same options and the json data source.



1.3.123 from_json(Column e, StructType schema, scala.collection.immutable.Map<String,String> options)

(Scala-specific) Parses a column containing a JSON string into a StructType with the specified schema. Returns null, in the case of an unparseable string.

Signature: Column from_json(Column e, StructType schema, scala.collec tion.immutable.Map<String,String> options).

Parameters:

Column e a string column containing JSON data. StructType schema the schema to use when parsing the json string. scala.collection.immutable.Map<String,String> options options to con-

trol how the json is parsed. Accepts the same options as the json data source.


1.3.124 from_unixtime(Column ut)

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

Signature: Column from_unixtime(Column ut). Parameter: Column ut. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime and conversion.

1.3.125 from_unixtime(Column ut, String f)

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

Signature: Column from_unixtime(Column ut, String f). Parameters:

Column ut. String f.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime and conversion.

1.3.126 from_utc_timestamp(Column ts, String tz)

Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in UTC, and ren-ders that time as a timestamp in the given time zone. For example, ‘GMT+1’ would yield ‘2017-07-14 03:40:00.0’.


Signature: Column from_utc_timestamp(Column ts, String tz). Parameters:

Column ts. String tz.


1.3.127 get_json_object(Column e, String path)

Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.

Signature: Column get_json_object(Column e, String path). Parameters:

Column e. String path.

Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: json, conversion, and string.

1.3.128 greatest(Column... exprs)

Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Signature: Column greatest(Column... exprs). Parameter: Column... exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.129 greatest(String columnName, String... columnNames)

Returns the greatest value of the list of column names, skipping null values. This func-tion takes at least 2 parameters. It will return null iff all parameters are null.

Signature: Column greatest(String columnName, String... columnNames). Parameters:


Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.


1.3.130 greatest(String columnName, scala.collection.Seq<String> columnNames)

Returns the greatest value of the list of column names, skipping null values. This func-tion takes at least 2 parameters. It will return null iff all parameters are null.

Signature: Column greatest(String columnName, scala.collection.Seq

<String> columnNames). Parameters:



1.3.131 greatest(scala.collection.Seq<Column> exprs)

Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Signature: Column greatest(scala.collection.Seq<Column> exprs). Parameter: scala.collection.Seq<Column> exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.132 grouping(Column e)

Aggregate function: indicates whether a specified column in a GROUP BY list is aggre-gated or not, returns 1 for aggregated or 0 for not aggregated in the result set.

Signature: Column grouping(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate.

1.3.133 grouping(String columnName)

Aggregate function: indicates whether a specified column in a GROUP BY list is aggre-gated or not, returns 1 for aggregated or 0 for not aggregated in the result set.

Signature: Column grouping(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: aggregate.


1.3.134 grouping_id(String colName, scala.collection.Seq<String> colNames)

Aggregate function: returns the level of grouping, equals to(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn).

Signature: Column grouping_id(String colName, scala.collection.Seq

<String> colNames). Parameters:


Returns: Column. Appeared in Apache Spark v2.0.0. Note: The list of columns should match with grouping columns exactly. This method is classified in: aggregate.

1.3.135 grouping_id(scala.collection.Seq<Column> cols)

Aggregate function: returns the level of grouping, equals to(grouping(c1) <<; (n-1)) + (grouping(c2) <<; (n-2)) + ... + grouping(cn).

Signature: Column grouping_id(scala.collection.Seq<Column> cols). Parameter: scala.collection.Seq<Column> cols. Returns: Column. Appeared in Apache Spark v2.0.0. Note: The list of columns should match with grouping columns exactly, or empty

(means all the grouping columns). This method is classified in: aggregate.

1.3.136 hash(Column... cols)

Calculates the hash code of given columns, and returns the result as an int column. Signature: Column hash(Column... cols). Parameter: Column... cols. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: digest.

1.3.137 hash(scala.collection.Seq<Column> cols)

Calculates the hash code of given columns, and returns the result as an int column. Signature: Column hash(scala.collection.Seq<Column> cols). Parameter: scala.collection.Seq<Column> cols. Returns: Column. Appeared in Apache Spark v2.0.0. This method is classified in: digest.

1.3.138 hex(Column column)

Computes hex value of the given column. Signature: Column hex(Column column).


Parameter: Column column. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: conversion.

1.3.139 hour(Column e)

Extracts the hours as an integer from a given date/timestamp/string. Signature: Column hour(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.140 hypot(Column l, Column r)

Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(Column l, Column r). Parameters:

Column l. Column r.

Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.141 hypot(Column l, String rightName)

Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(Column l, String rightName). Parameters:

Column l. String rightName.


1.3.142 hypot(Column l, double r)

Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(Column l, double r). Parameters:

Column l. double r.

Returns: Column.


Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.143 hypot(String leftName, Column r)

Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(String leftName, Column r). Parameters:

String leftName. Column r.


1.3.144 hypot(String leftName, String rightName)

Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(String leftName, String rightName). Parameters:

String leftName. String rightName.


1.3.145 hypot(String leftName, double r)

Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(String leftName, double r). Parameters:

String leftName. double r.


1.3.146 hypot(double l, Column r)

Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(double l, Column r). Parameters: double l. Column r.

Returns: Column.



1.3.147 hypot(double l, String rightName)

Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow. Signature: Column hypot(double l, String rightName). Parameters: double l. String rightName.


1.3.148 initcap(Column e)

Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace.

For example, “hello world” will become “Hello World”. Signature: Column initcap(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.149 input_file_name()

Creates a string column for the file name of the current Spark task. Signature: Column input_file_name(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: technical.

1.3.150 instr(Column str, String substring)

Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.

Signature: Column instr(Column str, String substring). Parameters:

Column str. String substring.

Returns: Column. Appeared in Apache Spark v1.5.0. Note: The position is not zero based, but 1 based index. Returns 0 if substr could

not be found in str. This method is classified in: string.


1.3.151 isnan(Column e)

Return true iff the column is NaN. Signature: Column isnan(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: validation.

1.3.152 isnull(Column e)

Return true iff the column is null. Signature: Column isnull(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: validation.

1.3.153 json_tuple(Column json, String... fields)

Creates a new row for a json column according to the given field names. Signature: Column json_tuple(Column json, String... fields). Parameters:

Column json. String... fields.

Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: json.

1.3.154 json_tuple(Column json, scala.collection.Seq<String> fields)

Creates a new row for a json column according to the given field names. Signature: Column json_tuple(Column json, scala.collection.Seq<String>

fields). Parameters:

Column json. scala.collection.Seq<String> fields.

Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: json.

1.3.155 kurtosis(Column e)

Aggregate function: returns the kurtosis of the values in a group. Signature: Column kurtosis(Column e). Parameter: Column e.



1.3.156 kurtosis(String columnName)

Aggregate function: returns the kurtosis of the values in a group. Signature: Column kurtosis(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate.

1.3.157 lag(Column e, int offset)

Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offsetof one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL. Signature: Column lag(Column e, int offset). Parameters:

Column e. int offset.

Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.158 lag(Column e, int offset, Object defaultValue)

Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL. Signature: Column lag(Column e, int offset, Object defaultValue). Parameters:

Column e. int offset. Object defaultValue.



1.3.159 lag(String columnName, int offset)

Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offsetof one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL. Signature: Column lag(String columnName, int offset). Parameters:

String columnName. int offset.


1.3.160 lag(String columnName, int offset, Object defaultValue)

Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

This is equivalent to the LAG function in SQL. Signature: Column lag(String columnName, int offset, Object default

Value). Parameters:

String columnName. int offset. Object defaultValue.


1.3.161 last(Column e)

Aggregate function: returns the last value in a group. The function by default returns the last values it sees. It will return the last non-null

value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column last(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and navigation.


1.3.162 last(Column e, boolean ignoreNulls)

Aggregate function: returns the last value in a group. The function by default returns the last values it sees. It will return the last non-null

value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column last(Column e, boolean ignoreNulls). Parameters:

Column e. boolean ignoreNulls.


1.3.163 last(String columnName)

Aggregate function: returns the last value of the column in a group. The function by default returns the last values it sees. It will return the last non-null

value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column last(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and navigation.

1.3.164 last(String columnName, boolean ignoreNulls)

Aggregate function: returns the last value of the column in a group. The function by default returns the last values it sees. It will return the last non-null

value it sees when ignoreNulls is set to true. If all values are null, then null is returned. Signature: Column last(String columnName, boolean ignoreNulls). Parameters:

String columnName. boolean ignoreNulls.


1.3.165 last_day(Column e)

Given a date column, returns the last day of the month which the given date belongs to. For example, input “2015-07-27” returns “2015-07-31” since July 31 is the last day of the month in July 2015.

Signature: Column last_day(Column e). Parameter: Column e. Returns: Column.


Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.166 lead(Column e, int offset)

Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL. Signature: Column lead(Column e, int offset). Parameters:

Column e. int offset.


1.3.167 lead(Column e, int offset, Object defaultValue)

Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL. Signature: Column lead(Column e, int offset, Object defaultValue). Parameters:

Column e. int offset. Object defaultValue.


1.3.168 lead(String columnName, int offset)

Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL. Signature: Column lead(String columnName, int offset). Parameters:

String columnName. int offset.

Returns: Column.


Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.169 lead(String columnName, int offset, Object defaultValue)

Window function: returns the value that is offset rows after the current row, and defaultValue if there is less than offset rows after the current row. For example, an offset of one will return the next row at any given point in the window partition.

This is equivalent to the LEAD function in SQL. Signature: Column lead(String columnName, int offset, Object default

Value). Parameters:

String columnName. int offset. Object defaultValue.


1.3.170 least(Column... exprs)

Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Signature: Column least(Column... exprs). Parameter: Column... exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.171 least(String columnName, String... columnNames)

Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Signature: Column least(String columnName, String... columnNames). Parameters:




1.3.172 least(String columnName, scala.collection.Seq<String> columnNames)

Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Signature: Column least(String columnName, scala.collection.Seq<String> columnNames).

Parameters:



1.3.173 least(scala.collection.Seq<Column> exprs)

Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.

Signature: Column least(scala.collection.Seq<Column> exprs). Parameter: scala.collection.Seq<Column> exprs. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: comparison and sorting.

1.3.174 length(Column e)

Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.

Signature: Column length(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.175 levenshtein(Column l, Column r)

Computes the Levenshtein distance of the two given string columns. Signature: Column levenshtein(Column l, Column r). Parameters:

Column l. Column r.



1.3.176 lit(Object literal)

Creates a Column of literal value. The passed in object is returned directly if it is already a Column. If the object is a

Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.

Signature: Column lit(Object literal). Parameter: Object literal. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: datashape and popular.

1.3.177 locate(String substr, Column str)

Locate the position of the first occurrence of substr. Signature: Column locate(String substr, Column str). Parameters:

String substr. Column str.

Returns: Column. Appeared in Apache Spark v1.5.0. Note: The position is not zero based, but 1 based index. Returns 0 if substr could


1.3.178 locate(String substr, Column str, int pos)

Locate the position of the first occurrence of substr in a string column, after position pos. Signature: Column locate(String substr, Column str, int pos). Parameters:

String substr. Column str. int pos.

Returns: Column. Appeared in Apache Spark v1.5.0. Note: The position is not zero based, but 1 based index. returns 0 if substr could


1.3.179 log(Column e)

Computes the natural logarithm of the given value. Signature: Column log(Column e). Parameter: Column e. Returns: Column.



1.3.180 log(String columnName)

Computes the natural logarithm of the given column. Signature: Column log(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.181 log(double base, Column a)

Returns the first argument-base logarithm of the second argument. Signature: Column log(double base, Column a). Parameters:

double base. Column a.


1.3.182 log(double base, String columnName)

Returns the first argument-base logarithm of the second argument. Signature: Column log(double base, String columnName). Parameters:

double base. String columnName.


1.3.183 log10(Column e)

Computes the logarithm of the given value in base 10. Signature: Column log10(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.184 log10(String columnName)

Computes the logarithm of the given value in base 10. Signature: Column log10(String columnName).


Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.185 log1p(Column e)

Computes the natural logarithm of the given value plus one. Signature: Column log1p(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.186 log1p(String columnName)

Computes the natural logarithm of the given column plus one. Signature: Column log1p(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and arithmetic.

1.3.187 log2(Column expr)

Computes the logarithm of the given column in base 2. Signature: Column log2(Column expr). Parameter: Column expr. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: mathematics and arithmetic.

1.3.188 log2(String columnName)

Computes the logarithm of the given value in base 2. Signature: Column log2(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: mathematics and arithmetic.

1.3.189 lower(Column e)

Converts a string column to lower case. Signature: Column lower(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: string.


1.3.190 lpad(Column str, int len, String pad)

Left-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.

Signature: Column lpad(Column str, int len, String pad). Parameters:

Column str. int len. String pad.


1.3.191 ltrim(Column e)

Trim the spaces from left end for the specified string value. Signature: Column ltrim(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.192 ltrim(Column e, String trimString)

Trim the specified character string from left end for the specified string column. Signature: Column ltrim(Column e, String trimString). Parameters:

Column e. String trimString.


1.3.193 map(Column... cols)

Creates a new map column. The input columns must be grouped as key-value pairs, for example: (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can’t be null. The value columns must all have the same data type.

Signature: Column map(Column... cols). Parameter: Column... cols. Returns: Column. Appeared in Apache Spark v2.0. This method is classified in: datashape.


1.3.194 map(scala.collection.Seq<Column> cols)

Creates a new map column. The input columns must be grouped as key-value pairs, for example: (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can’t be null. The value columns must all have the same data type.

Signature: Column map(scala.collection.Seq<Column> cols). Parameter: scala.collection.Seq<Column> cols. Returns: Column. Appeared in Apache Spark v2.0. This method is classified in: datashape.

1.3.195 map_keys(Column e)

Returns an unordered array containing the keys of the map. Signature: Column map_keys(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: datashape.

1.3.196 map_values(Column e)

Returns an unordered array containing the values of the map. Signature: Column map_values(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in: datashape.

1.3.197 max(Column e)

Aggregate function: returns the maximum value of the expression in a group. Signature: Column max(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and sorting.

1.3.198 max(String columnName)

Aggregate function: returns the maximum value of the column in a group. Signature: Column max(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and sorting.


1.3.199 md5(Column e)

Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.

Signature: Column md5(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest.

1.3.200 mean(Column e)

Aggregate function: returns the average of the values in a group. Alias for avg. Signature: Column mean(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: aggregate and statistics.

1.3.201 mean(String columnName)

Aggregate function: returns the average of the values in a group. Alias for avg. Signature: Column mean(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: aggregate and statistics.

1.3.202 min(Column e)

Aggregate function: returns the minimum value of the expression in a group. Signature: Column min(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and sorting.

1.3.203 min(String columnName)

Aggregate function: returns the minimum value of the column in a group. Signature: Column min(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate and sorting.


1.3.204 minute(Column e)

Extracts the minutes as an integer from a given date/timestamp/string. Signature: Column minute(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.205 monotonicallyIncreasingId()

Deprecated. Use monotonically_increasing_id(). Since 2.0.0. A column expression that generates monotonically increasing 64-bit integers.

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assump-tion is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.

Signature: Column monotonicallyIncreasingId(). Returns: Column. Appeared in Apache Spark v1.4.0. Function has been deprecated in Spark v2.0.0. A column expression that generates

monotonically increasing 64-bit integers. The generated ID is guaranteed to be mono-tonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records. As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:

0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594 and is replaced by mono tonically_increasing_id().

This method is classified in: deprecated.

1.3.206 monotonically_increasing_id()

A column expression that generates monotonically increasing 64-bit integers. The generated ID is guaranteed to be monotonically increasing and unique, but

not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assump-tion is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.

As an example, consider a DataFrame with two partitions, each with 3 records. This expression would return the following IDs:


0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.

Signature: Column monotonically_increasing_id(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: datashape.

1.3.207 month(Column e)

Extracts the month as an integer from a given date/timestamp/string. Signature: Column month(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.208 months_between(Column date1, Column date2)

Returns number of months between dates date1 and date2. Signature: Column months_between(Column date1, Column date2). Parameters:

Column date1. Column date2.


1.3.209 nanvl(Column col1, Column col2)

Returns col1 if it is not NaN, or col2 if col1 is NaN. Both inputs should be floating point columns (DoubleType or FloatType). Signature: Column nanvl(Column col1, Column col2). Parameters:

Column col1. Column col2.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: conditional.

1.3.210 negate(Column e)

Unary minus, that is: negate the expression.

// Select the amount column and negates all values.// Scala:df.select( -df("amount") ) // Java:df.select( negate(df.col("amount")) );.


Signature: Column negate(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics and arithmetic.

1.3.211 next_day(Column date, String dayOfWeek)

Given a date column, returns the first date which is later than the value of the date col-umn that is on the specified day of the week.

For example, next_day('2015-07-27', "Sunday") returns 2015-08-02 because that is the first Sunday after 2015-07-27.

Day of the week parameter is case insensitive, and accepts: “Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”, “Sun”.

Signature: Column next_day(Column date, String dayOfWeek). Parameters:

Column date. String dayOfWeek.


1.3.212 not(Column e)

Inversion of boolean expression, that is: NOT.

// Scala: select rows that are not active (isActive === false)df.filter( !df("isActive") ) // Java:df.filter( not(df.col("isActive")) );.

Signature: Column not(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: binary.

1.3.213 ntile(int n)

Window function: returns the ntile group id (from 1 to n inclusive) in an ordered win-dow partition. For example, if n is 4, the first quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will get 4.

This is equivalent to the NTILE function in SQL. Signature: Column ntile(int n). Parameter: int n. Returns: Column.


Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.214 percent_rank()

Window function: returns the relative rank (that is: percentile) of rows within a win-dow partition.

This is computed by:

(rank of row in its partition - 1) / (number of rows in the partition - 1)

This is equivalent to the PERCENT_RANK function in SQL. Signature: Column percent_rank(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: streaming.

1.3.215 pmod(Column dividend, Column divisor)

Returns the positive value of dividend mod divisor. Signature: Column pmod(Column dividend, Column divisor). Parameters:

Column dividend. Column divisor.


1.3.216 posexplode(Column e)

Creates a new row for each element with position in the given array or map column. Signature: Column posexplode(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: datashape.

1.3.217 posexplode_outer(Column e)

Creates a new row for each element with position in the given array or map column. Unlike posexplode, if the array/map is null or empty then the row (null, null) is pro-duced.

Signature: Column posexplode_outer(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: datashape.


1.3.218 pow(Column l, Column r)

Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(Column l, Column r). Parameters:

Column l. Column r.


1.3.219 pow(Column l, String rightName)

Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(Column l, String rightName). Parameters:

Column l. String rightName.


1.3.220 pow(Column l, double r)

Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(Column l, double r). Parameters:

Column l. double r.


1.3.221 pow(String leftName, Column r)

Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(String leftName, Column r). Parameters:

String leftName. Column r.



1.3.222 pow(String leftName, String rightName)

Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(String leftName, String rightName). Parameters:

String leftName. String rightName.


1.3.223 pow(String leftName, double r)

Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(String leftName, double r). Parameters:

String leftName. double r.


1.3.224 pow(double l, Column r)

Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(double l, Column r). Parameters:

double l. Column r.


1.3.225 pow(double l, String rightName)

Returns the value of the first argument raised to the power of the second argument. Signature: Column pow(double l, String rightName). Parameters:

double l. String rightName.



1.3.226 quarter(Column e)

Extracts the quarter as an integer from a given date/timestamp/string. Signature: Column quarter(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.227 radians(Column e)

Converts an angle measured in degrees to an approximately equivalent angle mea-sured in radians.

Signature: Column radians(Column e). Parameter: Column e angle in degrees. Returns: Column angle in radians, as if computed by java.lang.Math.toRadians. Appeared in Apache Spark v2.1.0. This method is classified in: mathematics and trigonometry.

1.3.228 radians(String columnName)

Converts an angle measured in degrees to an approximately equivalent angle mea-sured in radians.

Signature: Column radians(String columnName). Parameter: String columnName angle in degrees. Returns: Column angle in radians, as if computed by java.lang.Math.toRadians. Appeared in Apache Spark v2.1.0. This method is classified in: mathematics and trigonometry.

1.3.229 rand()

Generate a random column with independent and identically distributed (i.i.d.) sam-ples from U[0.0, 1.0\].

Signature: Column rand(). Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics.

1.3.230 rand(long seed)

Generate a random column with independent and identically distributed (i.i.d.) sam-ples from U[0.0, 1.0\].

Signature: Column rand(long seed). Parameter: long seed. Returns: Column. Appeared in Apache Spark v1.4.0. Note: This is indeterministic when data partitions are not fixed. This method is classified in: mathematics.


1.3.231 randn()

Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.

Signature: Column randn(). Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics.

1.3.232 randn(long seed)

Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.

Signature: Column randn(long seed). Parameter: long seed. Returns: Column. Appeared in Apache Spark v1.4.0. Note: This is indeterministic when data partitions are not fixed. This method is classified in: mathematics.

1.3.233 rank()

Window function: returns the rank of rows within a window partition. The difference between rank and dense_rank is that dense_rank leaves no gaps in

ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.

This is equivalent to the RANK function in SQL. Signature: Column rank(). Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: streaming.

1.3.234 regexp_extract(Column e, String exp, int groupIdx)

Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.

Signature: Column regexp_extract(Column e, String exp, int groupIdx). Parameters:

Column e. String exp. int groupIdx.

Returns: Column.


Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.235 regexp_replace(Column e, Column pattern, Column replacement)

Replace all substrings of the specified string value that match regexp with rep. Signature: Column regexp_replace(Column e, Column pattern, Column

replacement). Parameters:

Column e. Column pattern. Column replacement.


1.3.236 regexp_replace(Column e, String pattern, String replacement)

Replace all substrings of the specified string value that match regexp with rep. Signature: Column regexp_replace(Column e, String pattern, String

replacement). Parameters:

Column e. String pattern. String replacement.


1.3.237 repeat(Column str, int n)

Repeats a string column n times, and returns it as a new string column. Signature: Column repeat(Column str, int n). Parameters:

Column str. int n.


1.3.238 reverse(Column str)

Reverses the string column and returns it as a new string column. Signature: Column reverse(Column str).


Parameter: Column str. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string and array.

1.3.239 rint(Column e)

Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

Signature: Column rint(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: rounding and mathematics.

1.3.240 rint(String columnName)

Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

Signature: Column rint(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: rounding and mathematics.

1.3.241 round(Column e)

Returns the value of the column e rounded to 0 decimal places with HALF_UP round mode.

Signature: Column round(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: rounding and mathematics.

1.3.242 round(Column e, int scale)

Round the value of e to scale decimal places with HALF_UP round mode if scale is greater than or equal to 0 or at integral part when scale is less than 0.

Signature: Column round(Column e, int scale). Parameters:

Column e. int scale.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: rounding and mathematics.


1.3.243 row_number()

Window function: returns a sequential number starting at 1 within a window partition. Signature: Column row_number(). Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: streaming.

1.3.244 rpad(Column str, int len, String pad)

Right-pad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.

Signature: Column rpad(Column str, int len, String pad). Parameters:

Column str. int len. String pad.


1.3.245 rtrim(Column e)

Trim the spaces from right end for the specified string value. Signature: Column rtrim(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.246 rtrim(Column e, String trimString)

Trim the specified character string from right end for the specified string column. Signature: Column rtrim(Column e, String trimString). Parameters:



1.3.247 second(Column e)

Extracts the seconds as an integer from a given date/timestamp/string. Signature: Column second(Column e).


Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.248 sha1(Column e)

Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.

Signature: Column sha1(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest.

1.3.249 sha2(Column e, int numBits)

Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.

Signature: Column sha2(Column e, int numBits). Parameters:

Column e column to compute SHA-2 on. int numBits one of 224, 256, 384, or 512.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest.

1.3.250 shiftLeft(Column e, int numBits)

Shift the given value numBits left. If the given value is a long value, this function will return a long value else it will return an integer value.

Signature: Column shiftLeft(Column e, int numBits). Parameters:

Column e. int numBits.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: binary.

1.3.251 shiftRight(Column e, int numBits)

(Signed) shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.

Signature: Column shiftRight(Column e, int numBits).


Parameters:



1.3.252 shiftRightUnsigned(Column e, int numBits)

Unsigned shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.

Signature: Column shiftRightUnsigned(Column e, int numBits). Parameters:



1.3.253 signum(Column e)

Computes the signum of the given value. Signature: Column signum(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics.

1.3.254 signum(String columnName)

Computes the signum of the given column. Signature: Column signum(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics.

1.3.255 sin(Column e)

Computes the sine of an angle. Signature: Column sin(Column e). Parameter: Column e angle in radians. Returns: Column sine of the angle, as if computed by java.lang.Math.sin. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.


1.3.256 sin(String columnName)

Computes the sine of an angle. Signature: Column sin(String columnName). Parameter: String columnName angle in radians. Returns: Column sine of the angle, as if computed by java.lang.Math.sin. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.257 sinh(Column e)

Signature: Column sinh(Column e). Parameter: Column e hyperbolic angle. Returns: Column hyperbolic sine of the given value, as if computed by java.lang

.Math.sinh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.258 sinh(String columnName)

Signature: Column sinh(String columnName). Parameter: String columnName hyperbolic angle. Returns: Column hyperbolic sine of the given value, as if computed by java.lang

.Math.sinh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.259 size(Column e)

Returns length of array or map. Signature: Column size(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: array.

1.3.260 skewness(Column e)

Aggregate function: returns the skewness of the values in a group. Signature: Column skewness(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate.

1.3.261 skewness(String columnName)

Aggregate function: returns the skewness of the values in a group. Signature: Column skewness(String columnName).


Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate.

1.3.262 sort_array(Column e)

Sorts the input array for the given column in ascending order, according to the natu-ral ordering of the array elements.

Signature: Column sort_array(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: array and sorting.

1.3.263 sort_array(Column e, boolean asc)

Sorts the input array for the given column in ascending or descending order, accord-ing to the natural ordering of the array elements.

Signature: Column sort_array(Column e, boolean asc). Parameters:

Column e. boolean asc.

Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: array and sorting.

1.3.264 soundex(Column e)

Returns the soundex code for the specified expression. Signature: Column soundex(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.265 spark_partition_id()

Partition ID. Signature: Column spark_partition_id(). Returns: Column. Appeared in Apache Spark v1.6.0. Note: This is indeterministic because it depends on data partitioning and task

scheduling. This method is classified in: technical.


1.3.266 split(Column str, String pattern)

Splits str around pattern (pattern is a regular expression). Signature: Column split(Column str, String pattern). Parameters:

Column str. String pattern.

Returns: Column. Appeared in Apache Spark v1.5.0. Note: Pattern is a string representation of the regular expression. This method is classified in: popular and string.

1.3.267 sqrt(Column e)

Computes the square root of the specified float value. Signature: Column sqrt(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: mathematics and arithmetic.

1.3.268 sqrt(String colName)

Computes the square root of the specified float value. Signature: Column sqrt(String colName). Parameter: String colName. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: mathematics and arithmetic.

1.3.269 stddev(Column e)

Aggregate function: alias for stddev_samp. Signature: Column stddev(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.270 stddev(String columnName)

Aggregate function: alias for stddev_samp. Signature: Column stddev(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.


1.3.271 stddev_pop(Column e)

Aggregate function: returns the population standard deviation of the expression in a group.

Signature: Column stddev_pop(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.272 stddev_pop(String columnName)

Aggregate function: returns the population standard deviation of the expression in a group.

Signature: Column stddev_pop(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.273 stddev_samp(Column e)

Aggregate function: returns the sample standard deviation of the expression in a group. Signature: Column stddev_samp(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.274 stddev_samp(String columnName)

Aggregate function: returns the sample standard deviation of the expression in a group. Signature: Column stddev_samp(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.275 struct(Column... cols)

Creates a new struct column. If the input column is a column in a DataFrame, or a derived column expression that is named (that is: aliased), its name would be retained as the StructField's name, otherwise, the newly generated StructField's name would be auto generated as col with a suffix index + 1, that is: col1, col2, col3, ...

Signature: Column struct(Column... cols). Parameter: Column... cols. Returns: Column.


Appeared in Apache Spark v1.4.0. This method is classified in: datashape.

1.3.276 struct(String colName, String... colNames)

Creates a new struct column that composes multiple input columns. Signature: Column struct(String colName, String... colNames). Parameters:

String colName. String... colNames.

Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: datashape.

1.3.277 struct(String colName, scala.collection.Seq<String> colNames)

Creates a new struct column that composes multiple input columns. Signature: Column struct(String colName, scala.collection.Seq<String>

colNames). Parameters:



1.3.278 struct(scala.collection.Seq<Column> cols)

Creates a new struct column. If the input column is a column in a DataFrame, or a derived column expression that is named (that is: aliased), its name would be retained as the StructField's name, otherwise, the newly generated StructField's name would be auto generated as col with a suffix index + 1, that is: col1, col2, col3, ...

Signature: Column struct(scala.collection.Seq<Column> cols). Parameter: scala.collection.Seq<Column> cols. Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: datashape.

1.3.279 substring(Column str, int pos, int len)

Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.

Signature: Column substring(Column str, int pos, int len).


Parameters:

Column str. int pos. int len.

Returns: Column. Appeared in Apache Spark v1.5.0. Note: The position is not zero based, but 1 based index. This method is classified in: string.

1.3.280 substring_index(Column str, String delim, int count)

Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a case-sensitive match when searching for delim.

Signature: Column substring_index(Column str, String delim, int count). Parameters:

Column str. String delim. int count.

Returns: Column. This method is classified in: string.

1.3.281 sum(Column e)

Aggregate function: returns the sum of all values in the expression. Signature: Column sum(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.282 sum(String columnName)

Aggregate function: returns the sum of all values in the given column. Signature: Column sum(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.283 sumDistinct(Column e)

Aggregate function: returns the sum of distinct values in the expression. Signature: Column sumDistinct(Column e).


Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.284 sumDistinct(String columnName)

Aggregate function: returns the sum of distinct values in the expression. Signature: Column sumDistinct(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: aggregate.

1.3.285 tan(Column e)

Signature: Column tan(Column e). Parameter: Column e angle in radians. Returns: Column tangent of the given value, as if computed by java.lang.Math.tan. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.286 tan(String columnName)

Signature: Column tan(String columnName). Parameter: String columnName angle in radians. Returns: Column tangent of the given value, as if computed by java.lang.Math.tan. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.287 tanh(Column e)

Signature: Column tanh(Column e). Parameter: Column e hyperbolic angle. Returns: Column hyperbolic tangent of the given value, as if computed by java.lang

.Math.tanh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.

1.3.288 tanh(String columnName)

Signature: Column tanh(String columnName). Parameter: String columnName hyperbolic angle. Returns: Column hyperbolic tangent of the given value, as if computed by java.lang

.Math.tanh. Appeared in Apache Spark v1.4.0. This method is classified in: mathematics and trigonometry.


1.3.289 toDegrees(Column e)

Deprecated. Use degrees. Since 2.1.0. Signature: Column toDegrees(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. Function has been deprecated in Spark v2.1.0 and is replaced by degrees. This method is classified in: deprecated.

1.3.290 toDegrees(String columnName)

Deprecated. Use degrees. Since 2.1.0. Signature: Column toDegrees(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. Function has been deprecated in Spark v2.1.0 and is replaced by degrees. This method is classified in: deprecated.

1.3.291 toRadians(Column e)

Deprecated. Use radians. Since 2.1.0. Signature: Column toRadians(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.4.0. Function has been deprecated in Spark v2.1.0 and is replaced by radians. This method is classified in: deprecated.

1.3.292 toRadians(String columnName)

Deprecated. Use radians. Since 2.1.0. Signature: Column toRadians(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.4.0. Function has been deprecated in Spark v2.1.0 and is replaced by radians. This method is classified in: deprecated.

1.3.293 to_date(Column e)

Converts the column into DateType by casting rules to DateType. Signature: Column to_date(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime, conversion, and popular.


1.3.294 to_date(Column e, String fmt)

Converts the column into a DateType with a specified format (see http://docs.ora-cle.com/javase/tutorial/i18n/format/simpleDateFormat.html) return null if fail.

Signature: Column to_date(Column e, String fmt). Parameters:

Column e. String fmt.

Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: datetime, conversion, and popular.

1.3.295 to_json(Column e)

Converts a column containing a StructType, ArrayType of StructTypes, a MapType or ArrayType of MapTypes into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.

Signature: Column to_json(Column e). Parameter: Column e a column containing a struct or array of the structs. Returns: Column. Appeared in Apache Spark v2.1.0. This method is classified in: conversion and json.

1.3.296 to_json(Column e, java.util.Map<String,String> options)

(Java-specific) Converts a column containing a StructType, ArrayType of Struct Types, a MapType or ArrayType of MapTypes into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.

Signature: Column to_json(Column e, java.util.Map<String,String>options). Parameters:

Column e a column containing a struct or array of the structs. java.util.Map<String,String> options options to control how the struct

column is converted into a json string. accepts the same options and the json data source.


1.3.297 to_json(Column e, scala.collection.immutable.Map<String,String> options)

(Scala-specific) Converts a column containing a StructType, ArrayType of Struct Types, a MapType or ArrayType of MapTypes into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type.

Signature: Column to_json(Column e, scala.collection.immutable.Map

<String,String> options).

http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html\



Parameters:

Column e a column containing a struct or array of the structs. scala.collection.immutable.Map<String,String> options options to con-

trol how the struct column is converted into a json string. accepts the same options and the json data source.


1.3.298 to_timestamp(Column s)

Convert time string to a Unix timestamp (in seconds) by casting rules to Timestamp Type.

Signature: Column to_timestamp(Column s). Parameter: Column s. Returns: Column. Appeared in Apache Spark v2.2.0. This method is classified in: datetime and conversion.

1.3.299 to_timestamp(Column s, String fmt)

Convert time string to a Unix timestamp (in seconds) with a specified format (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix timestamp (in seconds), return null if fail.

Signature: Column to_timestamp(Column s, String fmt). Parameters:

Column s. String fmt.


1.3.300 to_utc_timestamp(Column ts, String tz)

Given a timestamp like ‘2017-07-14 02:40:00.0’, interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, ‘GMT+1’ would yield ‘2017-07-14 01:40:00.0’.

Signature: Column to_utc_timestamp(Column ts, String tz). Parameters:

Column ts. String tz.






1.3.301 translate(Column src, String matchingString, String replaceString)

Translate any character in the src by a character in replaceString. The characters in replaceString correspond to the characters in matchingString. The translate will hap-pen when any character in the string matches the character in the matchingString.

Signature: Column translate(Column src, String matchingString, String replaceString).

Parameters:

Column src. String matchingString. String replaceString.


1.3.302 trim(Column e)

Trim the spaces from both ends for the specified string column. Signature: Column trim(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: string.

1.3.303 trim(Column e, String trimString)

Trim the specified character from both ends for the specified string column. Signature: Column trim(Column e, String trimString). Parameters:



1.3.304 trunc(Column date, String format, format:)

Returns date truncated to the unit specified by the format. Signature: Column trunc(Column date, String format, format:). Parameters:

Column date. String format. format: ‘year’, ‘yyyy’, ‘yy’ for truncate by year, or ‘month’, ‘mon’, ‘mm’ for trun-

cate by month.



1.3.305 typedLit(T literal, scala.reflect.api.TypeTags.TypeTag<T> evidence$1)

Creates a Column of literal value. The passed in object is returned directly if it is already a Column. If the object is a

Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value. The difference between this function and lit is that this function can handle parameterized scala types for example: List, Seq and Map.

Signature: Column typedLit(T literal, scala.reflect.api.TypeTags.TypeTag <T> evidence$1).

Parameters:

T literal. scala.reflect.api.TypeTags.TypeTag<T> evidence$1.


1.3.306 udf(Object f, DataType dataType)

Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant, the caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeter-ministic, call the API UserDefinedFunction.asNondeterministic().

Signature: UserDefinedFunction udf(Object f, DataType dataType). Parameters:

Object f A closure in Scala. DataType dataType The output data type of the UDF.

Returns: UserDefinedFunction. Appeared in Apache Spark v2.0.0. This method is classified in: udf.

1.3.307 udf(UDF0<?> f, DataType returnType)

Defines a Java UDF0 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDe finedFunction.asNondeterministic().

Signature: UserDefinedFunction udf(UDF0<?> f, DataType returnType).


Parameters:

UDF0<?> f. DataType returnType.


1.3.308 udf(UDF10<?,?,?,?,?,?,?,?,?,?,?> f, DataType returnType)

Defines a Java UDF10 instance as user-defined function (UDF). The caller must spec-ify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDe finedFunction.asNondeterministic().

Signature: UserDefinedFunction udf(UDF10<?,?,?,?,?,?,?,?,?,?,?> f, Data Type returnType).

Parameters:

UDF10<?,?,?,?,?,?,?,?,?,?,?> f. DataType returnType.


1.3.309 udf(UDF1<?,?> f, DataType returnType)


Signature: UserDefinedFunction udf(UDF1<?,?> f, DataType returnType). Parameters:

UDF1<?,?> f. DataType returnType.


1.3.310 udf(UDF2<?,?,?> f, DataType returnType)


Signature: UserDefinedFunction udf(UDF2<?,?,?> f, DataType returnType).


Parameters:

UDF2<?,?,?> f. DataType returnType.


1.3.311 udf(UDF3<?,?,?,?> f, DataType returnType)


Signature: UserDefinedFunction udf(UDF3<?,?,?,?> f, DataType returnType). Parameters:

UDF3<?,?,?,?> f. DataType returnType.


1.3.312 udf(UDF4<?,?,?,?,?> f, DataType returnType)


Signature: UserDefinedFunction udf(UDF4<?,?,?,?,?> f, DataType return

Type). Parameters:

UDF4<?,?,?,?,?> f. DataType returnType.


1.3.313 udf(UDF5<?,?,?,?,?,?> f, DataType returnType)


Signature: UserDefinedFunction udf(UDF5<?,?,?,?,?,?> f, DataType return Type).


Parameters:

UDF5<?,?,?,?,?,?> f. DataType returnType.


1.3.314 udf(UDF6<?,?,?,?,?,?,?> f, DataType returnType)


Signature: UserDefinedFunction udf(UDF6<?,?,?,?,?,?,?> f, DataType

returnType). Parameters:

UDF6<?,?,?,?,?,?,?> f. DataType returnType.


1.3.315 udf(UDF7<?,?,?,?,?,?,?,?> f, DataType returnType)


Signature: UserDefinedFunction udf(UDF7<?,?,?,?,?,?,?,?> f, DataType

returnType). Parameters:

UDF7<?,?,?,?,?,?,?,?> f. DataType returnType.


1.3.316 udf(UDF8<?,?,?,?,?,?,?,?,?> f, DataType returnType)



Signature: UserDefinedFunction udf(UDF8<?,?,?,?,?,?,?,?,?> f, DataType returnType).

Parameters:

UDF8<?,?,?,?,?,?,?,?,?> f. DataType returnType.


1.3.317 udf(UDF9<?,?,?,?,?,?,?,?,?,?> f, DataType returnType)


Signature: UserDefinedFunction udf(UDF9<?,?,?,?,?,?,?,?,?,?> f, DataType returnType).

Parameters:

UDF9<?,?,?,?,?,?,?,?,?,?> f. DataType returnType.


1.3.318 udf(scala.Function0<RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$2)

Defines a Scala closure of 0 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDe finedFunction.asNondeterministic().

Signature: UserDefinedFunction udf(scala.Function0<RT> f, scala.reflect .api.TypeTags.TypeTag<RT> evidence$2).

Parameters:

scala.Function0<RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$2.



1.3.319 udf(scala.Function10<A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$57, scala.reflect.api.TypeTags.TypeTag<A1> evidence$58, scala.reflect.api.TypeTags.TypeTag<A2> evidence$59, scala.reflect.api.TypeTags.TypeTag<A3> evidence$60, scala.reflect.api.TypeTags.TypeTag<A4> evidence$61, scala.reflect.api.TypeTags.TypeTag<A5> evidence$62, scala.reflect.api.TypeTags.TypeTag<A6> evidence$63, scala.reflect.api.TypeTags.TypeTag<A7> evidence$64, scala.reflect.api.TypeTags.TypeTag<A8> evidence$65, scala.reflect.api.TypeTags.TypeTag<A9> evidence$66, scala.reflect.api.TypeTags.TypeTag<A10> evidence$67)


Signature: UserDefinedFunction udf(scala.Function10<A1,A2,A3,A4,A5,A6, A7,A8,A9,A10,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$57,

scala.reflect.api.TypeTags.TypeTag<A1> evidence$58,









scala.reflect.api.TypeTags.TypeTag<A10> evidence$67). Parameters:

scala.Function10<A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$57. scala.reflect.api.TypeTags.TypeTag<A1> evidence$58. scala.reflect.api.TypeTags.TypeTag<A2> evidence$59. scala.reflect.api.TypeTags.TypeTag<A3> evidence$60. scala.reflect.api.TypeTags.TypeTag<A4> evidence$61. scala.reflect.api.TypeTags.TypeTag<A5> evidence$62. scala.reflect.api.TypeTags.TypeTag<A6> evidence$63. scala.reflect.api.TypeTags.TypeTag<A7> evidence$64. scala.reflect.api.TypeTags.TypeTag<A8> evidence$65. scala.reflect.api.TypeTags.TypeTag<A9> evidence$66. scala.reflect.api.TypeTags.TypeTag<A10> evidence$67.

Returns: UserDefinedFunction.


Appeared in Apache Spark v1.3.0. This method is classified in: udf.

1.3.320 udf(scala.Function1<A1,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$3, scala.reflect.api.TypeTags.TypeTag<A1> evidence$4)


Signature: UserDefinedFunction udf(scala.Function1<A1,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$3,


scala.Function1<A1,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$3. scala.reflect.api.TypeTags.TypeTag<A1> evidence$4.


1.3.321 udf(scala.Function2<A1,A2,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$5, scala.reflect.api.TypeTags.TypeTag<A1> evidence$6, scala.reflect.api.TypeTags.TypeTag<A2> evidence$7)


Signature: UserDefinedFunction udf(scala.Function2<A1,A2,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$5,



scala.Function2<A1,A2,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$5. scala.reflect.api.TypeTags.TypeTag<A1> evidence$6. scala.reflect.api.TypeTags.TypeTag<A2> evidence$7.



1.3.322 udf(scala.Function3<A1,A2,A3,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$8, scala.reflect.api.TypeTags.TypeTag<A1> evidence$9, scala.reflect.api.TypeTags.TypeTag<A2> evidence$10, scala.reflect.api.TypeTags.TypeTag<A3> evidence$11)


Signature: UserDefinedFunction udf(scala.Function3<A1,A2,A3,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$8,




scala.Function3<A1,A2,A3,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$8. scala.reflect.api.TypeTags.TypeTag<A1> evidence$9. scala.reflect.api.TypeTags.TypeTag<A2> evidence$10. scala.reflect.api.TypeTags.TypeTag<A3> evidence$11.


1.3.323 udf(scala.Function4<A1,A2,A3,A4,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$12, scala.reflect.api.TypeTags.TypeTag<A1> evidence$13, scala.reflect.api.TypeTags.TypeTag<A2> evidence$14, scala.reflect.api.TypeTags.TypeTag<A3> evidence$15, scala.reflect.api.TypeTags.TypeTag<A4> evidence$16)


Signature: UserDefinedFunction udf(scala.Function4<A1,A2,A3,A4,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$12,




scala.reflect.api.TypeTags.TypeTag<A4> evidence$16).


Parameters:

scala.Function4<A1,A2,A3,A4,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$12. scala.reflect.api.TypeTags.TypeTag<A1> evidence$13. scala.reflect.api.TypeTags.TypeTag<A2> evidence$14. scala.reflect.api.TypeTags.TypeTag<A3> evidence$15. scala.reflect.api.TypeTags.TypeTag<A4> evidence$16.


1.3.324 udf(scala.Function5<A1,A2,A3,A4,A5,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$17, scala.reflect.api.TypeTags.TypeTag<A1> evidence$18, scala.reflect.api.TypeTags.TypeTag<A2> evidence$19, scala.reflect.api.TypeTags.TypeTag<A3> evidence$20, scala.reflect.api.TypeTags.TypeTag<A4> evidence$21, scala.reflect.api.TypeTags.TypeTag<A5> evidence$22)


Signature: UserDefinedFunction udf(scala.Function5<A1,A2,A3,A4,A5,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$17,






scala.Function5<A1,A2,A3,A4,A5,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$17. scala.reflect.api.TypeTags.TypeTag<A1> evidence$18. scala.reflect.api.TypeTags.TypeTag<A2> evidence$19. scala.reflect.api.TypeTags.TypeTag<A3> evidence$20. scala.reflect.api.TypeTags.TypeTag<A4> evidence$21. scala.reflect.api.TypeTags.TypeTag<A5> evidence$22.



1.3.325 udf(scala.Function6<A1,A2,A3,A4,A5,A6,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$23, scala.reflect.api.TypeTags.TypeTag<A1> evidence$24, scala.reflect.api.TypeTags.TypeTag<A2> evidence$25, scala.reflect.api.TypeTags.TypeTag<A3> evidence$26, scala.reflect.api.TypeTags.TypeTag<A4> evidence$27, scala.reflect.api.TypeTags.TypeTag<A5> evidence$28, scala.reflect.api.TypeTags.TypeTag<A6> evidence$29)


Signature: UserDefinedFunction udf(scala.Function6<A1,A2,A3,A4,A5,A6,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$23,







scala.Function6<A1,A2,A3,A4,A5,A6,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$23. scala.reflect.api.TypeTags.TypeTag<A1> evidence$24. scala.reflect.api.TypeTags.TypeTag<A2> evidence$25. scala.reflect.api.TypeTags.TypeTag<A3> evidence$26. scala.reflect.api.TypeTags.TypeTag<A4> evidence$27. scala.reflect.api.TypeTags.TypeTag<A5> evidence$28. scala.reflect.api.TypeTags.TypeTag<A6> evidence$29.


1.3.326 udf(scala.Function7<A1,A2,A3,A4,A5,A6,A7,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$30, scala.reflect.api.TypeTags.TypeTag<A1> evidence$31, scala.reflect.api.TypeTags.TypeTag<A2> evidence$32, scala.reflect.api.TypeTags.TypeTag<A3> evidence$33, scala.reflect.api.TypeTags.TypeTag<A4> evidence$34, scala.reflect.api.TypeTags.TypeTag<A5> evidence$35,


scala.reflect.api.TypeTags.TypeTag<A6> evidence$36, scala.reflect.api.TypeTags.TypeTag<A7> evidence$37)


Signature: UserDefinedFunction udf(scala.Function7<A1,A2,A3,A4,A5,A6, A7,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$30,








scala.Function7<A1,A2,A3,A4,A5,A6,A7,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$30. scala.reflect.api.TypeTags.TypeTag<A1> evidence$31. scala.reflect.api.TypeTags.TypeTag<A2> evidence$32. scala.reflect.api.TypeTags.TypeTag<A3> evidence$33. scala.reflect.api.TypeTags.TypeTag<A4> evidence$34. scala.reflect.api.TypeTags.TypeTag<A5> evidence$35. scala.reflect.api.TypeTags.TypeTag<A6> evidence$36. scala.reflect.api.TypeTags.TypeTag<A7> evidence$37.


1.3.327 udf(scala.Function8<A1,A2,A3,A4,A5,A6,A7,A8,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$38, scala.reflect.api.TypeTags.TypeTag<A1> evidence$39, scala.reflect.api.TypeTags.TypeTag<A2> evidence$40, scala.reflect.api.TypeTags.TypeTag<A3> evidence$41, scala.reflect.api.TypeTags.TypeTag<A4> evidence$42, scala.reflect.api.TypeTags.TypeTag<A5> evidence$43, scala.reflect.api.TypeTags.TypeTag<A6> evidence$44, scala.reflect.api.TypeTags.TypeTag<A7> evidence$45, scala.reflect.api.TypeTags.TypeTag<A8> evidence$46)

Defines a Scala closure of 8 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the


returned UDF is deterministic. To change it to nondeterministic, call the API UserDe-finedFunction.asNondeterministic().

Signature: UserDefinedFunction udf(scala.Function8<A1,A2,A3,A4,A5,A6,A7, A8,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$38,









scala.Function8<A1,A2,A3,A4,A5,A6,A7,A8,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$38. scala.reflect.api.TypeTags.TypeTag<A1> evidence$39. scala.reflect.api.TypeTags.TypeTag<A2> evidence$40. scala.reflect.api.TypeTags.TypeTag<A3> evidence$41. scala.reflect.api.TypeTags.TypeTag<A4> evidence$42. scala.reflect.api.TypeTags.TypeTag<A5> evidence$43. scala.reflect.api.TypeTags.TypeTag<A6> evidence$44. scala.reflect.api.TypeTags.TypeTag<A7> evidence$45. scala.reflect.api.TypeTags.TypeTag<A8> evidence$46.


1.3.328 udf(scala.Function9<A1,A2,A3,A4,A5,A6,A7,A8,A9,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$47, scala.reflect.api.TypeTags.TypeTag<A1> evidence$48, scala.reflect.api.TypeTags.TypeTag<A2> evidence$49, scala.reflect.api.TypeTags.TypeTag<A3> evidence$50, scala.reflect.api.TypeTags.TypeTag<A4> evidence$51, scala.reflect.api.TypeTags.TypeTag<A5> evidence$52, scala.reflect.api.TypeTags.TypeTag<A6> evidence$53, scala.reflect.api.TypeTags.TypeTag<A7> evidence$54, scala.reflect.api.TypeTags.TypeTag<A8> evidence$55, scala.reflect.api.TypeTags.TypeTag<A9> evidence$56)

Defines a Scala closure of 9 arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure’s signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDe-finedFunction.asNondeterministic().


Signature: UserDefinedFunction udf(scala.Function9<A1,A2,A3,A4,A5,A6,A7, A8,A9,RT> f, scala.reflect.api.TypeTags.TypeTag<RT> evidence$47,










scala.Function9<A1,A2,A3,A4,A5,A6,A7,A8,A9,RT> f. scala.reflect.api.TypeTags.TypeTag<RT> evidence$47. scala.reflect.api.TypeTags.TypeTag<A1> evidence$48. scala.reflect.api.TypeTags.TypeTag<A2> evidence$49. scala.reflect.api.TypeTags.TypeTag<A3> evidence$50. scala.reflect.api.TypeTags.TypeTag<A4> evidence$51. scala.reflect.api.TypeTags.TypeTag<A5> evidence$52. scala.reflect.api.TypeTags.TypeTag<A6> evidence$53. scala.reflect.api.TypeTags.TypeTag<A7> evidence$54. scala.reflect.api.TypeTags.TypeTag<A8> evidence$55. scala.reflect.api.TypeTags.TypeTag<A9> evidence$56.


1.3.329 unbase64(Column e)

Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64.

Signature: Column unbase64(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: digest.

1.3.330 unboundedFollowing()

Window function: returns the special frame boundary that represents the last row in the window partition.

Signature: Column unboundedFollowing(). Returns: Column.


Appeared in Apache Spark v2.3.0. This method is classified in:

1.3.331 unboundedPreceding()

Window function: returns the special frame boundary that represents the first row in the window partition.

Signature: Column unboundedPreceding(). Returns: Column. Appeared in Apache Spark v2.3.0. This method is classified in:

1.3.332 unhex(Column column)

Inverse of hex. Interprets each pair of characters as a hexadecimal number and con-verts to the byte representation of number.

Signature: Column unhex(Column column). Parameter: Column column. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: conversion.

1.3.333 unix_timestamp()

Returns the current Unix timestamp (in seconds). Signature: Column unix_timestamp(). Returns: Column. Appeared in Apache Spark v1.5.0. Note: All calls of unix_timestamp within the same query return the same value

(that is: the current timestamp is calculated at the start of query evaluation). This method is classified in: datetime.

1.3.334 unix_timestamp(Column s)

Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in sec-onds), using the default timezone and the default locale. Returns null if fails.

Signature: Column unix_timestamp(Column s). Parameter: Column s. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.335 unix_timestamp(Column s, String p)

Converts time string with given pattern to Unix timestamp (in seconds). Returns nullif fails.

Signature: Column unix_timestamp(Column s, String p).


Parameters:

Column s. String p.


1.3.336 upper(Column e)

Converts a string column to upper case. Signature: Column upper(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.3.0. This method is classified in: string.

1.3.337 var_pop(Column e)

Aggregate function: returns the population variance of the values in a group. Signature: Column var_pop(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.338 var_pop(String columnName)

Aggregate function: returns the population variance of the values in a group. Signature: Column var_pop(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.339 var_samp(Column e)

Aggregate function: returns the unbiased variance of the values in a group. Signature: Column var_samp(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.340 var_samp(String columnName)

Aggregate function: returns the unbiased variance of the values in a group. Signature: Column var_samp(String columnName).


Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.341 variance(Column e)

Aggregate function: alias for var_samp. Signature: Column variance(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.342 variance(String columnName)

Aggregate function: alias for var_samp. Signature: Column variance(String columnName). Parameter: String columnName. Returns: Column. Appeared in Apache Spark v1.6.0. This method is classified in: aggregate and statistics.

1.3.343 weekofyear(Column e)

Extracts the week number as an integer from a given date/timestamp/string. Signature: Column weekofyear(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

1.3.344 when(Column condition, Object value)

Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.

// Example: encoding gender string column into integer. // Scala:people.select(when(people("gender") === "male", 0) .when(people("gender") === "female", 1) .otherwise(2)) // Java:people.select(when(col("gender").equalTo("male"), 0) .when(col("gender").equalTo("female"), 1) .otherwise(2)).

Signature: Column when(Column condition, Object value).


Parameters:

Column condition. Object value.

Returns: Column. Appeared in Apache Spark v1.4.0. This method is classified in: conditional.

1.3.345 window(Column timeColumn, String windowDuration)

Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, for example: 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute tumbling window:

val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType

df.groupBy(window($"time", "1 minute"), $"stockId") .agg(mean("price"))

The windows will look like:

09:00:00-09:01:0009:01:00-09:02:0009:02:00-09:03:00 ...

For a streaming query, you may use the function current_timestamp to generate windows on processing time.

Signature: Column window(Column timeColumn, String windowDuration). Parameters:

Column timeColumn The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType.

String windowDuration A string specifying the width of the window, for exam-ple: 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInter-val for valid duration identifiers.


1.3.346 window(Column timeColumn, String windowDuration, String slideDuration)

Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, for example: 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support micro-


second precision. Windows in the order of months are not supported. The windows start beginning at 1970-01-01 00:00:00 UTC. The following example takes the average stock price for a one minute window every 10 seconds:


df.groupBy(window($"time", "1 minute", "10 seconds"), $"stockId") .agg(mean("price"))


09:00:00-09:01:0009:00:10-09:01:1009:00:20-09:01:20 ...


Signature: Column window(Column timeColumn, String windowDuration,

String slideDuration). Parameters:


String windowDuration A string specifying the width of the window, for exam-ple: 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInter-val for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example, 1 day always means 86,400,000 milliseconds, not a calendar day.

String slideDuration A string specifying the sliding interval of the window, for example: 1 minute. A new window will be generated every slideDuration. Must be less than or equal to the windowDuration. Check org.apache.spark.unsafe.types .CalendarInterval for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar.


1.3.347 window(Column timeColumn, String windowDuration, String slideDuration, String startTime)

Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, for example: 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support micro-second precision. Windows in the order of months are not supported. The following example takes the average stock price for a one minute window every 10 seconds start-ing 5 seconds after the hour:



df.groupBy(window($"time", "1 minute", "10 seconds", "5 seconds"), $"stockId")

.agg(mean("price"))


09:00:05-09:01:0509:00:15-09:01:1509:00:25-09:01:25 ...


Signature: Column window(Column timeColumn, String windowDuration,

String slideDuration, String startTime). Parameters:


String windowDuration A string specifying the width of the window, for exam-ple: 10 minutes, 1 second. Check org.apache.spark.unsafe.types.CalendarInter-val for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example, 1 day always means 86,400,000 milliseconds, not a calendar day.

String slideDuration A string specifying the sliding interval of the window, for example: 1 minute. A new window will be generated every slideDuration. Must be less than or equal to the windowDuration. Check org.apache.spark.unsafe.types .CalendarInterval for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar.

String startTime The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, for example: 12:15-13:15, 13:15-14:15... provide startTime as 15 minutes.


1.3.348 year(Column e)

Extracts the year as an integer from a given date/timestamp/string. Signature: Column year(Column e). Parameter: Column e. Returns: Column. Appeared in Apache Spark v1.5.0. This method is classified in: datetime.

04 Guide to static functions for Apache Spark v2.3.4 · from_json(Column e, DataType schema,...

Documents

Transcript of 04 Guide to static functions for Apache Spark v2.3.4 · from_json(Column e, DataType schema,...