Image et doublons

Links: notebook, html, PDF, python, slides, slides(2), GitHub

Material for the hackathon ENSAE / BRGM / 2018. Les images sont extraites de tweets mais sont retweetées sans être retweetées.

%matplotlib inline
import matplotlib.pyplot as plt
from jyquickhelper import add_notebook_menu
add_notebook_menu()

Séparation des doublons

Pour le challenge, il faut repérer les doublons dans les images. Pour cela, je zoom chaque image sur un carré 50x50 en noir et blanc, suivi d’une ACP puis k plus proches voisins pour détecter les doublons.

Images en gris 50x50

folder = "c:/temp/suricatenat_images"
from ensae_projects.hackathon.image_helper import apply_image_transform, image_zoom, img2gray

dest_folder = "img5050"
list(apply_image_transform(folder, dest_folder, lambda img: image_zoom(img2gray(img), (50, 50)), fLOG=print))

Images en features

Pas utilisé par la suite.

from ensae_projects.hackathon.image_helper import stream_image2features
import numpy

dest_folder = "img5050"
dest_batch = "batch"
for b in stream_image2features(dest_folder, dest_batch, numpy.array, fLOG=print):
    pass

voisins

%matplotlib inline
from ensae_projects.hackathon.image_knn import ImageNearestNeighbors
folder = "img5050"
knn = ImageNearestNeighbors()
knn.fit(folder, fLOG=print)
[ImageNearestNeighbors] processing image 0: 'inondation_2016735614357036519425_CjVtTTrUoAAUUZp.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 1000: 'inondation_2016737596119933321217_Cjx3w1FVAAAyyY1.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 2000: 'inondation_2016737891662077255685_Cj2EjjXWUAA8Dhq.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 3000: 'inondation_2016738050337521709056_Cj4UpFDUoAIR2gD.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 4000: 'inondation_2016738283056302313472_Cj7oe7VWkAAPwAT.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 5000: 'inondation_2016738366585526718464_Cj80fFNXEAAx9T2.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 6000: 'inondation_2016738439428159377408_Cj92vvAUYAARP2A.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 7000: 'inondation_2016738629637845221376_CkAjvUFVAAErbJF.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 8000: 'inondation_2016738695722296614912_CkBf1CbXIAAonp1.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 9000: 'inondation_2016738766013416787968_CkCfqR3XIAEuX8m.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 10000: 'inondation_2016738894521304526849_CkEUnRhW0AEh1e1.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 11000: 'inondation_2016739101985295728640_CkHRVZ-WUAAyCrp.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 12000: 'inondation_2016739400457899114496_CkLgzBAWkAE9hCa.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 13000: 'inondation_2016739732522427424768_CkQOztKWYAAlOul.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 14000: 'inondation_2016740054590863859712_CkUzuikWgAAJUTK.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 15000: 'inondation_2016740416207296299008_CkZ8nAnWYAAG7cC.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 16000: 'inondation_2016740833843914153985_Ckf4dIFWEAANSOT.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 17000: 'inondation_2016742361701924937728_Ck1mBsLWkAE6EQX.jpg' - class 'inondation_2016'
[ImageNearestNeighbors] processing image 18000: 'inondation_2018955391968712019968_DUI76ywW4AA2J1b.jpg' - class 'inondation_2018'
[ImageNearestNeighbors] processing image 19000: 'inondation_2018956216357934325761_LKmRQ9hLmVxOkWtm.jpg' - class 'inondation_2018'
[ImageNearestNeighbors] processing image 20000: 'inondation_2018957254473604268032_DUjZ2vSWkAAdzd2.jpg' - class 'inondation_2018'
[ImageNearestNeighbors] processing image 21000: 'inondation_2018959020320320565248_DU8fYlpX4AAZIRV.jpg' - class 'inondation_2018'
[ImageNearestNeighbors] processing image 22000: 'inondation_2018964034081381109761_DWDv4vHWsAAMQIS.jpg' - class 'inondation_2018'
[ImageNearestNeighbors] processing image 23000: 'seisme_Amatrice768290329543995392_MwkGcfSrCBzWbxwK.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 24000: 'seisme_Amatrice768326333034364928_CqmktbfXEAAw2RU.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 25000: 'seisme_Amatrice768345861646581760_Cqm2eUjWYAAWS68.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 26000: 'seisme_Amatrice768361403522646016_CqnEgFrWcAANqdO.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 27000: 'seisme_Amatrice768374709645967361_CqnQt96XEAAew2V.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 28000: 'seisme_Amatrice768387852862455810_CqncoxLWYAAulnb.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 29000: 'seisme_Amatrice768401257769865216_CqnlYItWAAAbc7p.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 30000: 'seisme_Amatrice768417849652027394_Cqnz5_gXgAAx967.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 31000: 'seisme_Amatrice768433724564377600_CqoGZC4WAAEh8zG.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 32000: 'seisme_Amatrice768451168372662272_CqoWQCGW8AQIbHV.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 33000: 'seisme_Amatrice768468307288743936_Cqol1cDXgAEychm.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 34000: 'seisme_Amatrice768488406091386880_Cqo4H6GWIAAr4YP.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 35000: 'seisme_Amatrice768511762429800448_CqpNXTxXYAATvNk.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 36000: 'seisme_Amatrice768543842845032448_CqplczAWIAAhINz.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 37000: 'seisme_Amatrice768647190260518912_CqrIhKnUkAAyvf6.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 38000: 'seisme_Amatrice768716815279063040_CqsH3mqUEAA6gXD.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 39000: 'seisme_Amatrice768743738634080256_CqsgWwgWcAE0rWO.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 40000: 'seisme_Amatrice768772807568351232_Cqs6ORfWIAAXlf8.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 41000: 'seisme_Amatrice768804543748575232_CqtXniSXYAAp7Tt.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 42000: 'seisme_Amatrice768843712357076993_Cqt7R_1WYAE6tr6.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 43000: 'seisme_Amatrice768901703898771456_Cquv7mKWgAAn6ZX.jpg' - class 'seisme_Amatrice'
[ImageNearestNeighbors] processing image 44000: 'suricatenat_inondation_aude1052220109740228608_Dpo8nOhXgAYLNEm.jpg' - class 'suricatenat_inondation_aude'
from ensae_projects.hackathon.image_helper import enumerate_image_class
folder = "img5050"
iter = enumerate_image_class(folder)
imgs = [_[0] for _ in zip(iter, range(0,1000000))]
len(imgs)
44053
for i, img in enumerate(imgs):
    dist, ind = knn.kneighbors(img[0])
    if dist[0, 1] <= 10:
        print("dist =", dist)
        print("ind =", ind)
        break
dist = [[  0.           0.           7.93725393 366.16662874 380.73481585]]
ind = [[   12     3    10 21464  8684]]
knn.plot_neighbors(ind, dist, obs=img[0], folder_or_images=folder);
../_images/images_dups_14_0.png
pairs = []
for i, img in enumerate(imgs):
    if i % 1000 == 0:
        print("{0}/{1} done".format(i, len(imgs)))
    dist, ind = knn.kneighbors(img[0])
    sub = ind.ravel()[dist.ravel() <= 10]
    if len(sub) > 0:
        for j in sub:
            pairs.append((i, j))
0/44053 done
1000/44053 done
2000/44053 done
3000/44053 done
4000/44053 done
5000/44053 done
6000/44053 done
7000/44053 done
8000/44053 done
9000/44053 done
10000/44053 done
11000/44053 done
12000/44053 done
13000/44053 done
14000/44053 done
15000/44053 done
16000/44053 done
17000/44053 done
18000/44053 done
19000/44053 done
20000/44053 done
21000/44053 done
22000/44053 done
23000/44053 done
24000/44053 done
25000/44053 done
26000/44053 done
27000/44053 done
28000/44053 done
29000/44053 done
30000/44053 done
31000/44053 done
32000/44053 done
33000/44053 done
34000/44053 done
35000/44053 done
36000/44053 done
37000/44053 done
38000/44053 done
39000/44053 done
40000/44053 done
41000/44053 done
42000/44053 done
43000/44053 done
44000/44053 done
pairs[:10]
[(0, 0),
 (1, 1),
 (2, 2),
 (3, 12),
 (3, 3),
 (3, 10),
 (4, 4),
 (5, 133),
 (5, 1549),
 (5, 158)]
pairs2 = [(i,j) for i,j in pairs if i != j]
len(pairs), len(pairs2)
(75725, 33675)
pairs2[:10]
[(3, 12),
 (3, 10),
 (5, 133),
 (5, 1549),
 (5, 158),
 (5, 5632),
 (5, 16784),
 (8, 14699),
 (8, 23),
 (8, 35)]
dist, ind = knn.kneighbors(imgs[5][0])
knn.plot_neighbors(ind, dist, obs=imgs[5][0], folder_or_images=folder);
../_images/images_dups_19_0.png

Composantes connectes

distincts = []
for i, j in pairs2:
    distincts.append(i)
    distincts.append(j)
distincts = set(distincts)
connex = {}
for k in distincts:
    connex[k] = k

n = 0
while n < 10:
    modif = 0
    for i, j in pairs2:
        a = min(connex[i], connex[j])
        if a != connex[i] or a != connex[j]:
            modif += 1
        connex[i] = connex[j] = a
    print(n, modif)
    n += 1
0 9096
1 6
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
len(connex), len(set(connex.values()))
(13271, 4185)
names = knn.image_names_
names[:2]
['inondation_2016/735614357036519425_CjVtTTrUoAAUUZp.jpg',
 'inondation_2016/735616090261184512_CjVu73ZVEAAlWmu.jpg']
dups = []
for i, j in connex.items():
    if i != j:
        dups.append(names[i])
len(dups)
9086

Images très proches

for i, img in enumerate(imgs):
    dist, ind = knn.kneighbors(img[0])
    if 10 < dist[0, 1] <= 30:
        print("dist =", dist)
        print("ind =", ind)
        break
dist = [[  0.          21.97726098  21.97726098  21.97726098 161.13348504]]
ind = [[ 285  308  351  311 3005]]
obs = imgs[ind[0, 0]][0]
knn.plot_neighbors(ind, dist, obs=obs, folder_or_images=folder);
../_images/images_dups_27_0.png

Recopie de la base

not_allowed = set(dups)
len(not_allowed)
9086
list(sorted(not_allowed))[:5]
['inondation_2016/735805396657397762_CjYbG-DUgAQTu19.jpg',
 'inondation_2016/735829559329853440_CjYxFcrXEAAvjlH.jpg',
 'inondation_2016/735870604038045696_CjZWafAXEAA3sOb.jpg',
 'inondation_2016/735892072960512000_CjZp8CoWsAIOhL5.jpg',
 'inondation_2016/735892650583306240_CjZqdvoXAAEaSRM.jpg']
from ensae_projects.hackathon.image_helper import stream_copy_images

src_folder = "c:/temp/suricatenat_images/"
dest_folder = "c:/temp/suricatenat_clean/"

def valid(name):
    spl = name.split("suricatenat_images")[-1].replace("\\", "/").strip("/\\")
    return spl not in allowed

for img in stream_copy_images(src_folder, dest_folder, valid, fLOG=print):
    pass
[stream_copy_images] copy image 0: 'bing01-9.jpg' - class 'bing'
[stream_copy_images] copy image 1000: 'imagenet13271012508_955158b073.jpg' - class 'imagenet1'
[stream_copy_images] copy image 2000: 'imagenet23287016043_987800dc67.jpg' - class 'imagenet2'
[stream_copy_images] copy image 3000: 'imagenet4106994_5349_big_200907_voyager11.jpg' - class 'imagenet4'
[stream_copy_images] copy image 4000: 'imagenet5532346050_dafb11ec86.jpg' - class 'imagenet5'
[stream_copy_images] copy image 5000: 'inondation_2016736966968138473472_Cjo7jTrXAAAeffo.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 6000: 'inondation_2016737629970399252480_CjySiGiUkAUr8TC.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 7000: 'inondation_2016737923554407256064_Cj2hYNwWUAElsOP.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 8000: 'inondation_2016738072076880347136_Cj4opLHXEAAuuGK.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 9000: 'inondation_2016738298504267730945_Cj72k1kUoAAIKUA.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 10000: 'inondation_2016738378724442296321_Cj8_iRUXEAEIYex.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 11000: 'inondation_2016738456441082793984_Cj-GNbPWkAAecmj.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 12000: 'inondation_2016738642491671379968_CkAvbhyVAAQdhnl.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 13000: 'inondation_2016738708144893927424_CkBrBsFXIAAesMt.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 14000: 'inondation_2016738775822753013760_CkCosKRXEAAL3QS.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 15000: 'inondation_2016738983572388913152_CkFlodbW0AAjH1A.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 16000: 'inondation_2016739133036877467649_CkHtiX3XEAAQ5qt.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 17000: 'inondation_2016739435820709519360_CkMA9WNXAAEBBwW.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 18000: 'inondation_2016739759634534141958_CkQnd1TUUAQli3i.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 19000: 'inondation_2016740101248225935361_CkVVPYDWUAAc8U3.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 20000: 'inondation_2016740462147130556416_CkamZeeXAAIf6ru.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 21000: 'inondation_2016740924772062769152_CkhLHExW0AIpwYC.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 22000: 'inondation_2016742979124050964480_Ck-XkQfXEAE46Wh.jpg' - class 'inondation_2016'
[stream_copy_images] copy image 23000: 'inondation_2018955500762070769664_DUKe4P3WAAEBJFC.jpg' - class 'inondation_2018'
[stream_copy_images] copy image 24000: 'inondation_2018956447069216165890_DUX7giCXUAANfkI.jpg' - class 'inondation_2018'
[stream_copy_images] copy image 25000: 'inondation_2018957555126931279872_DUnrT9aXUAARFxJ.jpg' - class 'inondation_2018'
[stream_copy_images] copy image 26000: 'inondation_2018959394452564598784_DVB0KQsWkAA4Bta.jpg' - class 'inondation_2018'
[stream_copy_images] copy image 27000: 'inondation_2018965549350599487488_DWZSA7cWsAEWGaK.jpg' - class 'inondation_2018'
[stream_copy_images] copy image 28000: 'seisme_Amatrice768296828550819841_CqmJ4k4UsAEcTaF.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 29000: 'seisme_Amatrice768330792049205248_CqmooXdXgAAJ2b3.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 30000: 'seisme_Amatrice768348574694408192_Cqm4itvWcAAsv_s.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 31000: 'seisme_Amatrice768363756728516608_CqnGo90WIAAfA1I.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 32000: 'seisme_Amatrice768376884677738496_CqnOR17WIAAP2Hn.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 33000: 'seisme_Amatrice768390411228422144_Cqne_UWWYAAnY6V.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 34000: 'seisme_Amatrice768404063755141120_CqnrVy8XYAAYIGO.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 35000: 'seisme_Amatrice768420565745106944_Cqn6bjbWIAEAbck.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 36000: 'seisme_Amatrice768436635444908032_CqoI-OfWIAEpe5T.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 37000: 'seisme_Amatrice768453842098880512_CqoYsPEXEAARM5o.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 38000: 'seisme_Amatrice768471447140458496_CqoosvJW8AIjpEA.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 39000: 'seisme_Amatrice768492129882517506_Cqo7hBpW8AA64OU.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 40000: 'seisme_Amatrice768516668515577856_CqpR0mDWIAAEGkL.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 41000: 'seisme_Amatrice768550981206441984_Cqpw-qOWAAAVGTB.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 42000: 'seisme_Amatrice768679088013778944_CqrlaXlVUAAcqVG.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 43000: 'seisme_Amatrice768721000015892480_CqsLrL6UkAAofwM.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 44000: 'seisme_Amatrice768749206500741120_CqslU7hWEAA_Cyn.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 45000: 'seisme_Amatrice768777504609931264_Cqs_DzcWAAAZ2_h.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 46000: 'seisme_Amatrice768810730250461184_CqtdRvNWAAAE_HJ.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 47000: 'seisme_Amatrice768850688487022592_Cqsp-GyWgAEJ6Kp.jpg' - class 'seisme_Amatrice'
[stream_copy_images] copy image 48000: 'seisme_Amatrice768916332322648064_Cqu9OPSWgAEzaux.jpg' - class 'seisme_Amatrice'
l1 = list(enumerate_image_class("c:/temp/suricatenat_images/"))
l2 = list(enumerate_image_class("c:/temp/suricatenat_clean/"))
len(l1), len(l2)
(48884, 39798)

Takes a random sample

from ensae_projects.hackathon.image_helper import stream_random_sample, last_element
rnd = last_element(stream_random_sample("c:/temp/suricatenat_clean/", abspath=False))
rnd[:5]
[('imagenet2\2611787731_6b65bdaf6a.jpg', 'imagenet2'),
 ('inondation_2016\740608740169224192_CkcruUEXIAEsWUl.jpg',
  'inondation_2016'),
 ('inondation_2016\738614580658606080_CkAWBegUgAA5Z9l.jpg',
  'inondation_2016'),
 ('inondation_2018\956548703552245760_DUZX5TRWsAAyDqH.jpg',
  'inondation_2018'),
 ('inondation_2018\956925376936148993_DUeuiGQX4AAocq-.jpg',
  'inondation_2018')]
import os
import shutil

src_folder = "c:/temp/suricatenat_clean/"
dest_folder = "c:/temp/suricatenat_sample/"


for img, sub in rnd:
    src = os.path.join(src_folder, img)
    dst = os.path.join(dest_folder, img)
    d = os.path.dirname(dst)
    if not os.path.exists(d):
        os.makedirs(d)

    shutil.copy(src, dst)