How to build your own google

  • Published on
    18-Feb-2017

  • View
    327

  • Download
    1

Transcript

How to build your own google ... artur.grzadziel@gmail.com Data Wizards Dec 2015 mailto:artur.grzadziel@gmail.commailto:artur.grzadziel@gmail.comArtur Grzdziel few words about me email: artur.grzadziel@gmail.com Currently: BigData and Machine Learning Leader From Jan 2016: BigData Solution Architect at General Electric PhD in progress at PAN (Polish Academy of Sciences) Systems Research Institute Graduated from Warsaw University of Technology and Warsaw School of Economics BigData & Machine Learning enthusiast focused on leveraging Big Data and Machine Learning in real business cases Privately, husband and father pl.linkedin.com/in/ArturGrzadziel mailto:artur.grzadziel@gmail.comIntroduction Data Wizards Artur represents Data Wizards group informal group of BigData/Machine Learning/Data Science professionals located in Poland and interested in knowledge sharing and addressing business challenges leveraging modern BigData and Machine Learning methods. Agenda 1. Cloudera search 2. How it works? MySearch very high level architecture Data Source Index Cloudera search Apache Solr and Tika 1. Other Sources Cloudera Search Cloudera Search is one of Cloudera's near-real-time access products. Cloudera Search enables non-technical users to search and explore data stored in or ingested into Hadoop and HBase. Users do not need SQL or programming skills to use Cloudera Search because it provides a simple, full-text interface for searching. Cloudera Search incorporates Apache Solr, which includes Apache Lucene, SolrCloud, Apache Tika, and Solr Cell. Cloudera Search is tightly integrated with Cloudera's Distribution, including Apache Hadoop (CDH). Cloudera Search provides these key capabilities: - Near-real-time indexing - Batch indexing - Simple, full-text data exploration and navigated drill down http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.html http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlCloudera search Tika https://tika.apache.org/download.html Cloudera search Tika image Cloudera search Tika PDF file Cloudera search Tika gazeta.pl Cloudera search Tika formats Supported Document Formats HyperText Markup Language XML and derived formats Microsoft Office document formats OpenDocument Format Portable Document Format Electronic Publication Format Rich Text Format Compression and packaging formats Text formats Audio formats Image formats Video formats Java class files and archives The mbox format https://tika.apache.org/1.4/formats.html https://tika.apache.org/1.4/formats.htmlhttps://tika.apache.org/1.4/formats.htmlCloudera search Solr how to start it .\bin\solr start e cloud -noprompt http://lucene.apache.org/solr/ http://lucene.apache.org/solr/http://lucene.apache.org/solr/Cloudera Search Administration Cloudera Search Data id cat name price inStock author series_t sequence_i genre_s 553573403 book A Game of Thrones 7.99 TRUE George R.R. Martin A Song of Ice and Fire 1 fantasy 553579908 book A Clash of Kings 7.99 TRUE George R.R. Martin A Song of Ice and Fire 2 fantasy 055357342X book A Storm of Swords 7.99 TRUE George R.R. Martin A Song of Ice and Fire 3 fantasy 553293354 book Foundation 7.99 TRUE Isaac Asimov Foundation Novels 1 scifi 812521390 book The Black Company 6.99 FALSE Glen Cook The Chronicles of The Black Company 1 fantasy 812550706 book Ender's Game 6.99 TRUE Orson Scott Card Ender 1 scifi 441385532 book Jhereg 7.95 FALSE Steven Brust Vlad Taltos 1 fantasy 380014300 book Nine Princes In Amber 6.99 TRUE Roger Zelazny the Chronicles of Amber 1 fantasy 805080481 book The Book of Three 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 1 fantasy 080508049X book The Black Cauldron 5.99 TRUE Lloyd Alexander The Chronicles of Prydain 2 fantasy Cloudera Search Output format Cloudera Search Simple query Cloudera Search Simple query Cloudera Search More advanced query Cloudera Search Query with facets Cloudera search Solr other features The MoreLikeThis search component enables users to query for documents similar to a document in their result list. It is achieved leveraging terms from the original document to find similar documents in the index The SpellCheck component is designed to provide inline query suggestions based on other, similar, terms. Highlighting in Solr allows fragments of documents that match the user's query to be included with the query response. Synonyms, stop words Cloudera search Solr other features geospacial search Solr has sophisticated geospatial support, including searching within a specified distance range of a given location (or within a bounding box), sorting by distance, or even boosting results by the distance http://lucene.apache.org/solr/quickstart.html http://lucene.apache.org/solr/quickstart.htmlCloudera Search Common Use Cases Cloudera Search lets your entire business explore and analyze data quickly and easily for a variety of critical use cases all within a single platform, including: - Threat detection - Customer 360-degree visibility - Improved user experience - Interactive market segmentation - Accessible global knowledge base https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.html https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlCloudera Search Other Use Cases Instagram: Instagram (a Facebook company) is one of the famous sites, and it uses Solr to power its geosearch API WhiteHouse.gov: The Obama administration's website is inbuilt in Drupal and Solr Netflix: Solr powers basic movie searching on this extremely busy site StubHub.com: This ticket reseller uses Solr to help visitors search for concerts and sporting events. https://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.html http://whitehouse.gov/http://whitehouse.gov/http://stubhub.com/https://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlhttps://www.safaribooksonline.com/library/view/scaling-apache-solr/9781783981748/ch01s05.htmlHow it works ... ? How it works ? Data Source documents Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog How it works ? Data Source documents space of unique terms Document Content 1 John has a cat 2 John has a dog 3 Eva has a cat 4 George has a dog 1 2 3 4 1 2 3 5 6 2 3 4 7 2 3 4 List of unique words: 1. John 2. has 3. a 4. cat 5. dog 6. Eva 7. George How it works ? Data Source Documents boolean search with inverted index Term Tot. freq. John 2 has 4 a 4 cat 2 dog 2 Eva 1 George 1 Doc # 1 2 1 2 3 4 1 2 3 4 1 3 2 4 3 4 Dictionary Documents How it works ? Data Source documents as vectors Documents document 1 John has a cat document 2 John has a dog document 3 Eva has a cat document 4 George has a dog Space of unique terms -> John has a cat dog Eva George vector representing doc1 -> 1 1 1 1 0 0 0 vector representing doc2 -> 1 1 1 0 1 0 0 vector representing doc3 -> 0 1 1 1 0 1 0 vector representing doc4 -> 0 1 1 0 1 0 1 How it works ? Data Source Documents vectors Summary 1. Other Sources Thank you Data Wizards E-mail: artur.grzadziel@gmail.com Links: Cloudera Search: http://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.html Tika https://tika.apache.org/ Apache Solr http://lucene.apache.org/solr/ https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.html Vectors, Inversed Index, Frequency Matrix, etc. ... http://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm mailto:artur.grzadziel@gmail.comhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttp://www.cloudera.com/content/www/en-us/documentation/archive/search/1-3-0/Cloudera-Search-User-Guide/csug_introducing.htmlhttps://tika.apache.org/https://tika.apache.org/http://lucene.apache.org/solr/http://lucene.apache.org/solr/https://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttps://www.cloudera.com/content/www/en-us/products/apache-hadoop/apache-solr.htmlhttp://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htmhttp://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htmhttp://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htmhttp://courses.ischool.berkeley.edu/i202/f05/LectureNotes/202-20051108.htm