Bash in_array madness

You have the problem that you want to check if a value is in an array in Bash. Well, then you have more than one problem, or in other words: the fun begins. :)

The following is partly a script that can be executed and partly a post. Probably I should use something like Rmarkdown for bash-scripts or post the snippets as Gists, but I am going to leave it like that for now.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
#!/bin/bash

# ----------------------------------------------------------
# These are switches you want to set anyway
# ----------------------------------------------------------

# Exit on error
set -e

# No uninitialized variables
set -u

# ----------------------------------------------------------
# The problem: check if a value is in an array
# or: find the needle in the haystack
# <https://stackoverflow.com/questions/3685970/check-if-an-array-contains-a-value#3689445>
# ----------------------------------------------------------

# The haystack
HAYSTACK=(one two three)
# The needle
NEEDLE=one

# ----------------------------------------------------------
# Way 1: case in solution
# <http://stackoverflow.com/a/3689445>
# ----------------------------------------------------------

# Given the fact that case is used to implement complex conditionals
# this top stackoverflow answer seems weird:

case "${HAYSTACK[@]}" in *"$NEEDLE"*) echo "1a) found";; esac

# Why? because of the way case works

# Remember that case  matches the "EXPR" against the conditions CASE1) to
# CASEN)  and exectues the commands CMD specified for that condition, but only
# for the first match, so 
# - the order of CASE1 … CASEN is very important.
# - we are doing pattern matching

# Syntax:
# case EXPR in CASE1) CMD;; … CASEN) CMD;; esac

# Concrete example:
# case "f" in "abc") echo "abc";; "f") echo "f";; esac

# The top answer turns this around: here the HAYSTACK is the thing to look for
# and we check if it is found in NEEDLE, where needle is made into a globbing
# pattern that matches for any substrings that contains NEEDLE.

# That means it will match this:

NEEDLE=o
case "${HAYSTACK[@]}" in *"$NEEDLE"*) echo "1b) falsely found";; esac

# which is certainly not what you want.


# ----------------------------------------------------------
# Way 2: regex matching
# <http://stackoverflow.com/a/3686056>
# ----------------------------------------------------------

# Let us consider the following two additional arrays:
NEEDLE=one
SINGLETON=(one)
SPACES=(one "two two" three "four")

# Spaces have been a long standing problem for bash
# in fact if I have the time I would like to write about bash quirks more :)
# The first thing is: we are again using regular expression matching in a weird
# way: <STRING> =~ <ERE>
# the HAYSTACK is the <STRING> and NEEDLE is the <ERE> # and it is not good
# practice to quote the rhs of an ERE because it is interpreted as a string and
# not as  a regex. (Thank you, shellcheck).

if [[ " ${HAYSTACK[@]} " =~ " ${NEEDLE} " ]]; then echo "2a) found"; fi

# Which looks good. But let's check for substrings:
NEEDLE=o

if [[ " ${HAYSTACK[@]} " =~ " ${NEEDLE} " ]]; then echo "2a1) found"; fi

# Which looks good as well.
# But let's check for substrings that contain the input field separator (IFS):

NEEDLE=two
if [[ " ${SPACES[@]} " =~ " ${NEEDLE} " ]]; then echo "2b) falsely found"; fi
# This should not be the case

# but the other properties look fine:
NEEDLE=o
if [[ " ${SPACES[@]} " =~ " ${NEEDLE} " ]]; then echo "2c) found"; fi
if [[ " ${SINGLETON[@]} " =~ " ${NEEDLE} " ]]; then echo "2d) found"; fi


# so this solution require to change the IFS (which breaks if a value contains
# the new IFS)

IFS=$'\t'
HAYSTACK=(one\ttwo two\tthree)
unset IFS

NEEDLE=two
# so this is the old behaviour
if [[ " ${SPACES[@]} " =~ " ${NEEDLE} " ]]; then echo "2e) falsely found"; fi
# so this is the new behaviour
if [[ " ${HAYSTACK[@]} " =~ " ${NEEDLE} " ]]; then echo "2f) falsely found"; fi

# buuuuut this does not work:
NEEDLE="two two"
if [[ " ${HAYSTACK[@]} " =~ " ${NEEDLE} " ]]; then echo "2g) found"; fi
NEEDLE=two\ two
if [[ " ${HAYSTACK[@]} " =~ " ${NEEDLE} " ]]; then echo "2g) found"; fi

# at this point I am giving up on this path

# ----------------------------------------------------------
# for loop in a function (lots of sources, since this is the obvious way in
# procedural programming)
#
# <http://stackoverflow.com/a/3686262>
# <http://stackoverflow.com/a/19072965>
# ----------------------------------------------------------

# The sane thing should be a loop.
# But then we would need to know what we are comparing.
# -eq is for numbers and == for strings. A version for strings could be:

function failing_in_array(){
    local THIS_ARRAY=$1
    local THIS_VALUE=$2
    #printf "%s" "$THIS_ARRAY"
    for i in "${THIS_ARRAY[@]}"
    do
        #printf "%s" "$i"
        if [ "$i" == "$THIS_VALUE" ] ; then
            printf "y\n"
            return 0
        fi
    done
    printf "n\n" "\n"
    return 0
    #return 1
}

# INPUT
HAYSTACK=(one two three)
NEEDLE=one
SINGLETON=(one)
SPACES=(one "two two" three "four")

if [[ $(failing_in_array "$HAYSTACK" "$NEEDLE") == "y" ]]; then printf "3a) found\n"; fi

NEEDLE=two
if [[ $(failing_in_array "$HAYSTACK" "$NEEDLE") == "y" ]]; then printf "3b) found\n"; else printf "3b) falsely not found\n"; fi

# So why does this not work for the last case?
# there are a lot of things going wrong here:

# - first: you can not pass an array as an argument in bash
# - second: THIS_ARRAY is not an array. It is a string.
# - third: $HAYSTACK is not the array, but only the first element of the array.

# (at this point I feel like a carpenter who is only allowed to use broken tools)

# what about first?
# so there are two ways of doing this: pass the values (the expanded array) or
# pass by name. Both have the disadvantage that they are unable to distinguish
# the last element of the array from the second parameter.

# For the function:
# in_array (a, b, c) d
# is equivalent to
# in_array (a, b, c, d)
# because both lists of arguments evaluate to
# a, b, c, d

# So the function only gets a list of values but does not know how it was
# called. So, can we live with that? I think yes, I can and it is better
# then to mess around with the IFS.


# And lastly, why do we have to use a weird output value like y and n?
# Because if the function would exit with 1 the script would stop because we
# are using set -e (and we won't unset this because a function that requires
# that would be … crappy). But that is always the case with -e and a function
# that returns something different to 0. So, let's accept that too.

# so in the end our in_array function would be like this:

function in_array(){
    local ARGS=("$@")
    # printf "ALL ARGS: %s\n" "${ARGS[@]}"
    local THIS_ARRAY=("${ARGS[@]:0:((${#ARGS[@]} - 1))}")
    # printf "ARRAY: %s\n" "${THIS_ARRAY[@]}"
    local THIS_VALUE="${ARGS[*]:((${#ARGS[@]} - 1)):1}"
    # printf "VALUE: %s\n" "${THIS_VALUE[@]}"
    for i in "${THIS_ARRAY[@]}"
    do
        # printf "i: %s\n" "$i"
        if [ "$i" == "$THIS_VALUE" ] ; then
            printf "y\n"
            return 0
        fi
    done
    printf "n\n" "\n"
    return 0
    #return 1
}

# INPUT
HAYSTACK=(one two three)
NEEDLE=one
SINGLETON=(one)
SPACES=(one "two two" three "four")

if [[ $(in_array "${HAYSTACK[@]}" "$NEEDLE") == "y" ]]; then printf "4a) found\n"; else printf "4a) falsely not found\n"; fi

# in_array "${HAYSTACK[@]}" "$NEEDLE"

NEEDLE=two

if [[ $(in_array "${HAYSTACK[@]}" "$NEEDLE") == "y" ]]; then printf "4b) found\n"; else printf "4b) falsely not found\n"; fi

# in_array "${HAYSTACK[@]}" "$NEEDLE"

NEEDLE=two
if [[ $(in_array "${SPACES[@]}" "$NEEDLE") == "y" ]]; then printf "4c) falsely found\n"; else printf "4c) not found\n"; fi

# so this looks good. does it break on spaces?
NEEDLE="two two"
if [[ $(in_array "${SPACES[@]}" "$NEEDLE") == "y" ]]; then printf "4d) found\n"; else printf "4d) falsely not found\n"; fi

if [[ $(in_array "${HAYSTACK[@]}" "$NEEDLE") == "y" ]]; then printf "4e) falsely found\n"; else printf "4e) not found\n"; fi
# no. good. and on numbers?

NEEDLE=2
HAYSTACK=(one 2 three)
if [[ $(in_array "${HAYSTACK[@]}" "$NEEDLE") == "y" ]]; then printf "4f) found\n"; else printf "4f) falsely not found\n"; fi
# look like the perfect thing :)

# this function does not differentiate between not-string and string. So "2" ==
# 2 is True but that is the default behaviour of bash and zsh

NEEDLE=2
HAYSTACK=(one "2" three)
if [[ $(in_array "${HAYSTACK[@]}" "$NEEDLE") == "y" ]]; then printf "4g) falsely found\n"; else printf "4g) not found\n"; fi


# Optimizations

# Note that you could check before if it is worth to traverse the array
# by using grep at the beginning and you could also time the different solutions.
# Ugh let's leave it like this for today.


# other solutions I did not look at

# use a declarative array
# <http://stackoverflow.com/a/14550606>

# use grep
# <http://stackoverflow.com/a/5086746>
# inarray=$(echo ${haystack[@]} | grep -o "needle" | wc -w)

Summary

Due to the way bash handles arrays we went on a bigger and longer journey than we should. Is this really productive? I like to look at programming and scripting languages in depth, so we'll see, maybe there will be similar posts about other languages. If you know a language I should write about, write me and I'll consider it. I like to complain too. :)