A whirlwind tour of graph databases

  • Published on
    18-Mar-2018

  • View
    211

  • Download
    2

DESCRIPTION

GraphDB Whirlwind Tour Michael Hunger Code Days - OOP (Michael Hunger)-[:WORKS_FOR]->(Neo4j) michael@neo4j.com | @mesirii | github.com/jexp | jexp.de/blog Michael Hunger…

Transcript

GraphDB Whirlwind Tour Michael Hunger Code Days - OOP (Michael Hunger)-[:WORKS_FOR]->(Neo4j) michael@neo4j.com | @mesirii | github.com/jexp | jexp.de/blog Michael Hunger - Head of Developer Relations @Neo4j Why Graphs ? Use Cases Data Model Query- ing Neo4j Why Graphs? Because the World is a Graph! Everything and Everyone is Connected • people, places, events • companies, markets • countries, history, politics • sciences, art, teaching • technology, networks, machines, applications, users • software, code, dependencies, architecture, deployments • criminals, fraudsters and their behavior Value from Relationships Value from Data Relationships Common Use Cases Internal Applications Master Data Management Network and IT Operations Fraud Detection Customer-Facing Applications Real-Time Recommendations Graph-Based Search Identity and Access Management The Rise of Connections in Data Networks of People Business Processes Knowledge Networks E.g., Risk management, Supply chain, Payments E.g., Employees, Customers, Suppliers, Partners, Influencers E.g., Enterprise content, Domain specific content, eCommerce content Data connections are increasing as rapidly as data volumes 9 Harnessing Connections Drives Business Value Enhanced Decision Making Hyper Personalization Massive Data Integration Data Driven Discovery & Innovation Product Recommendations Personalized Health Care Media and Advertising Fraud Prevention Network Analysis Law Enforcement Drug Discovery Intelligence and Crime Detection Product & Process Innovation 360 view of customer Compliance Optimize Operations Connected Data at the Center AI & Machine Learning Price optimization Product Recommendations Resource allocation Digital Transformation Megatrends Graph Databases are HOT Graph Databases Are Hot Lots of Choice Newcomers in the last 3 years • DSE Graph • Agens Graph • IBM Graph • JanusGraph • Tibco GraphDB • Microsoft CosmosDB • TigerGraph • MemGraph • AWS Neptune • SAP HANA Graph Database Technology Architectures Graph DB Connected DataDiscrete Data Relational DBMSOther NoSQL Right Tool for the Job The impact of Graphs How Graphs are changing the World GRAPHS FOR GOOD Neo4j ICIJ Distribution Better Health with Graphs Cancer Research - Candiolo Cancer Institute “Our application relies on complex hierarchical data, which required a more flexible model than the one provided by the traditional relational database model,” said Andrea Bertotti, MD neo4j.com/case-studies/candiolo-cancer-institute-ircc/ Graph Databases in Healthcare and Life Sciences 14 Presenters from all around Europe on: • Genome • Proteome • Human Pathway • Reactome • SNP • Drug Discovery • Metabolic Symbols • ... neo4j.com/blog/neo4j-life-sciences-healthcare-workshop-berlin/ DISRUPTION WITH GRAPHS BETTER BUSINESS WITH GRAPHS 28 Real-Time Recommendations Fraud Detection Network & IT Operations Master Data Management Knowledge Graph Identity & Access Management Common Graph Technology Use Cases AirBnb 30 • Record “Cyber Monday” sales • About 35M daily transactions • Each transaction is 3-22 hops • Queries executed in 4ms or less • Replaced IBM Websphere commerce • 300M pricing operations per day • 10x transaction throughput on half the hardware compared to Oracle • Replaced Oracle database • Large postal service with over 500k employees • Neo4j routes 7M+ packages daily at peak, with peaks of 5,000+ routing operations per second. Handling Large Graph Work Loads for Enterprises Real-time promotion recommendations Marriott’s Real-time Pricing Engine Handling Package Routing in Real-Time Software Financial Services Telecom Retail & Consumer Goods Media & Entertainment Other Industries Airbus NEW INSIGHTS WITH GRAPHS Machine Learning is Based on Graphs The Property Graph Model, Import, Query The Whiteboard Model Is the Physical Model Eliminates Graph-to- Relational Mapping In your data model Bridge the gap between business and IT models In your application Greatly reduce need for application code CAR name: “Dan” born: May 29, 1970 twitter: “@dan” name: “Ann” born: Dec 5, 1975 since: Jan 10, 2011 brand: “Volvo” model: “V70” Property Graph Model Components Nodes • The objects in the graph • Can have name-value properties • Can be labeled Relationships • Relate nodes by type and direction • Can have name-value properties LOVES LOVES LIVES WITH PERSON PERSON Cypher: Powerful and Expressive Query Language MATCH (:Person { name:“Dan”} ) -[:LOVES]-> (:Person { name:“Ann”} ) LOVES Dan Ann LABEL PROPERTY NODE NODE LABEL PROPERTY Relational Versus Graph Models Relational Model Graph Model KNOWS ANDREAS TOBIAS MICA DELIA Person FriendPerson-Friend ANDREAS DELIA TOBIAS MICA Retail ... Recommendations Our starting point – Northwind ER Building Relationships in Graphs ORDERED Customer Order Order Locate Foreign Keys (FKs)-[:BECOME]->(Relationships) & Correct Directions Drop Foreign Keys Find the Join Tables Simple Join Tables Becomes Relationships Attributed Join Tables Become Relationships with Properties (One) Northwind Graph Model (:You)-[:QUERY]->(:Data) in a graph Who bought Chocolat? You all know SQL SELECT distinct c.CompanyName FROM customers AS c JOIN orders AS o ON (c.CustomerID = o.CustomerID) JOIN order_details AS od ON (o.OrderID = od.OrderID) JOIN products AS p ON (od.ProductID = p.ProductID) WHERE p.ProductName = 'Chocolat' Apache Tinkerpop 3.3.x - Gremlin g = graph.traversal(); g.V().hasLabel('Product') .has('productName','Chocolat') .in('INCLUDES') .in('ORDERED') .values('companyName').dedup(); W3C Sparql PREFIX sales_db: SELECT distinct ?company_name WHERE { ?company_name . ?c ?o . ?o ?od . ?od ?p . ?p "Chocolat" . } openCypher MATCH (c:Customer)-[:ORDERED]->(o) -[:INCLUDES]->(p:Product) WHERE p.productName = 'Chocolat' RETURN distinct p.companyName Basic Pattern: Customers Orders? MATCH (:Customer {custName:"Delicatessen"} ) -[:ORDERED]-> (order:Order) RETURN order VAR LABEL NODE NODE LABEL PROPERTY ORDERED Customer Order Order REL Basic Query: Customer's Orders? MATCH (c:Customer)-[:ORDERED]->(order) WHERE c.customerName = 'Delicatessen' RETURN * Basic Query: Customer's Frequent Purchases? MATCH (c:Customer)-[:ORDERED]-> ()-[:INCLUDES]->(p:Product) WHERE c.customerName = 'Delicatessen' RETURN p.productName, count(*) AS freq ORDER BY freq DESC LIMIT 10; openCypher - Recommendation MATCH (c:Customer)-[:ORDERED]->(o1)-[:INCLUDES]->(p), (peer)-[:ORDERED]->(o2)-[:INCLUDES]->(p), (peer)-[:ORDERED]->(o3)-[:INCLUDES]->(reco) WHERE c.customerId = $customerId AND NOT (c)-[:ORDERED]->()-[:INCLUDES]->(reco) RETURN reco.productName, count(*) AS freq ORDER BY freq DESC LIMIT 10 Product Cross-Sell MATCH (:Product {productName: 'Chocolat'})(cross:Product) RETURN employee.firstName, cross.productName, count(distinct o2) AS freq ORDER BY freq DESC LIMIT 5; openCypher openCypher... ...is a community effort to evolve Cypher, and to make it the most useful language for querying property graphs openCypher implementations SAP Hana Graph, Redis, Agens Graph, Cypher.PL, Neo4j github.com/opencypher Language Artifacts ● Cypher 9 specification ● ANTLR and EBNF Grammars ● Formal Semantics (SIGMOD) ● TCK (Cucumber test suite) ● Style Guide Implementations & Code ● openCypher for Apache Spark ● openCypher for Gremlin ● open source frontend (parser) ● ... Cypher 10 ● Next version of Cypher ● Actively working on natural language specification ● New features ○ Subqueries ○ Multiple graphs ○ Path patterns ○ Configurable pattern matching semantics Extending Neo4j Extending Neo4j - User Defined Procedures & Functions Neo4j Execution Engine User Defined Procedure User Defined Functions Applications Bolt User Defined Procedures & Functions let you write custom code that is: • Written in any JVM language • Deployed to the Database • Accessed by applications via Cypher Procedure Examples Built-In • Metadata Information • Index Management • Security • Cluster Information • Query Listing & Cancellation • ... Libraries • APOC (std library) • Spatial • RDF (neosemantics) • NLP • ... neo4j.com/developer/procedures-functions Example: Data(base) Integration Graph Analytics Neo4j Graph Algorithms ”Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data- driven operations and decisions“ The Impact of Connected Data Existing Options (so far) •Data Processing •Spark with GraphX, Flink with Gelly •Gremlin Graph Computer •Dedicated Graph Processing •Urika, GraphLab, Giraph, Mosaic, GPS, Signal-Collect, Gradoop •Data Scientist Toolkit • igraph, NetworkX, Boost in Python, R, C Goal: Iterate Quickly •Combine data from sources into one graph •Project to relevant subgraphs •Enrich data with algorithms •Traverse, collect, filter aggregate with queries •Visualize, Explore, Decide, Export •From all APIs and Tools 1. Call as Cypher procedure 2. Pass in specification (Label, Prop, Query) and configuration 3. ~.stream variant returns (a lot) of results CALL algo..stream('Label','TYPE',{conf}) YIELD nodeId, score 4. non-stream variant writes results to graph returns statistics CALL algo.('Label','TYPE',{conf}) Usage Pass in Cypher statement for node- and relationship-lists. CALL algo.( 'MATCH ... RETURN id(n)', 'MATCH (n)-->(m) RETURN id(n) as source, id(m) as target', {graph:'cypher'}) Cypher Projection DEMO: OOP Development Data Storage and Business Rules Execution Data Mining and Aggregation Neo4j Fits into Your Environment Application Graph Database Cluster Neo4j Neo4j Neo4j Ad Hoc Analysis Bulk Analytic Infrastructure Graph Compute Engine EDW … Data Scientist End User Databases Relational NoSQL Hadoop Official Language Drivers • Foundational drivers for popular programming languages • Bolt: streaming binary wire protocol • Authoritative mapping to native type system, uniform across drivers • Pluggable into richer frameworks JavaScript Java .NET Python PHP, .... Drivers Bolt Bolt + Official Language Drivers http://neo4j.com/developer/ http://neo4j.com/developer/language-guides/ Using Bolt: Official Language Drivers look all the same With JavaScript var driver = Graph.Database.driver("bolt://localhost"); var session = driver.session(); var result = session.run("MATCH (u:User) RETURN u.name"); neo4j.com/developer/spring-data-neo4j Spring Data Neo4j Neo4j OGM @NodeEntity public class Talk { @Id @GeneratedValue Long id; String title; Slot slot; Track track; @Relationship(type="PRESENTS", direction=INCOMING) Set speaker = new HashSet(); } Spring Data Neo4j Neo4j OGM interface TalkRepository extends Neo4jRepository { @Query("MATCH (t:Talk) github.com/neoj4-contrib/neo4j-spark-connector Neo4j Spark Connector https://github.com/jexp/neo4j-spark-connector github.com/neo4j-contrib/neo4j-jdbc Neo4j JDBC Driver https://github.com/jexp/neo4j-spark-connector Neo4j THE Graph Database Platform Graph Transactions Graph Analytics Data Integration Development & Admin Analytics Tooling Drivers & APIs Discovery & Visualization Developers Admins Applications Business Users Data Analysts Data Scientists • Operational workloads • Analytics workloads Real-time Transactional and Analytic Processing • Interactive graph exploration • Graph representation of data Discovery and Visualization • Native property graph model • Dynamic schema Agilit y • Cypher - Declarative query language • Procedural language extensions • Worldwide developer community Developer Productivity • 10x less CPU with index-free adjacency • 10x less hardware than other platforms Hardware efficiency Neo4j: Graph Platform Performance • Index-free adjacency • Millions of hops per second Index-free adjacency ensures lightning- fast retrieval of data and relationships Native Graph Architecture Index free adjacency Unlike other database models Neo4j connects data as it is stored Neo4j Query Planner Cost based Query Planner since Neo4j • Uses transactional database statistics • High performance Query Engine • Bytecode compiled queries • Future: Parallism 1 2 3 4 5 6 Architecture Components Index-Free Adjacency In memory and on flash/disk vs ACID Foundation Required for safe writes Full-Stack Clustering Causal consistency Security Language, Drivers, Tooling Developer Experience, Graph Efficiency Graph Engine Cost-Based Optimizer, Graph Statistics, Cypher Runtime Hardware Optimizations For next-gen infrastructure Neo4j – allows you to connect the dots • Was built to efficiently • store, • query and • manage highly connected data • Transactional, ACID • Real-time OLTP • Open source • Highly scalable on few machines High Query Performance: Some Numbers • Traverse 2-4M+ relationships per second and core • Cost based query optimizer – complex queries return in milliseconds • Import 100K-1M records per second transactionally • Bulk import tens of billions of records in a few hours Get Started Neo4j Sandbox How do I get it? Desktop – Container – Cloud http://neo4j.com/download/ docker run neo4j http://neo4j.com/download/ Neo4j Cluster Deployment Options • Developer: Neo4j Desktop (free Enterprise License) • On premise – Standalone or via OS package • Containerized with official Docker Image • In the Cloud • AWS, GCE, Azure • Using Resource Managers • DC/OS – Marathon • Kubernetes • Docker Swarm 10M+ Downloads 3M+ from Neo4j Distribution 7M+ from Docker Events 400+ Approximate Number of Neo4j Events per Year 50k+ Meetups Number of Meetup Members Globally Active Community 50k+ Trained/certified Neo4j professionals Trained Developers Summary: Graphs allow you ... • Keep your rich data model • Handle relationships efficiently • Write queries easily • Develop applications quickly • Have fun Thank You! Questions?! @neo4j | neo4j.com @mesirii | Michael Hunger Users Love Neo4j Causal Clustering Core & Replica Servers Causal Consistency Causal Clustering - Features • Two Zones – Core + Edge • Group of Core Servers – Consistent and Partition tolerant (CP) • Transactional Writes • Quorum Writes, Cluster Membership, Leader via Raft Consensus • Scale out with Read Replicas • Smart Bolt Drivers with • Routing, Read & Write Sessions • Causal Consistency with Bookmarks • For massive query throughput • Read-only replicas • Not involved in Consensus Commit • Disposable, suitable for auto-scaling Replica • Small group of Neo4j databases • Fault-tolerant Consensus Commit • Responsible for data safety Core Writing to the Core Cluster Neo4j Driver ✓ ✓ ✓ Success Neo4j Cluster Application Server Neo4j Driver Max Jim Jane Mar k Routed write statements driver = GraphDatabase.driver( "bolt+routing://aCoreServer" ); try ( Session session = driver.session( AccessMode.WRITE ) ) { try ( Transaction tx = session.beginTransaction() ) { tx.run( "MERGE (user:User {userId: {userId}})", parameters( "userId", userId ) ); tx.success(); } } Bookmark • Session token • String (for portability) • Opaque to application • Represents ultimate user’s most recent view of the graph • More capabilities to come Data Redundancy Massive Throughput High Availability 3.0 Bigger Clusters Consensus Commit Built-in load balancing 3.1Causal Clusteri ng Neo4j 3.0 Neo4j 3.1 High Availability Cluster Causal Cluster Master-Slave architecture Paxos consensus used for master election Raft protocol used for leader election, membership changes and commitment of all transactions Two part cluster: writeable Core and read-only read replicas. Transaction committed once written durably on the master Transaction committed once written durably on a majority of the core members Practical deployments: 10s servers Practical deployments: 100s servers Causal Clustering - Features • Two Zones – Core + Edge • Group of Core Servers – Consistent and Partition tolerant (CP) • Transactional Writes • Quorum Writes, Cluster Membership, Leader via Raft Consensus • Scale out with Read Replicas • Smart Bolt Drivers with • Routing, Read & Write Sessions • Causal Consistency with Bookmarks • For massive query throughput • Read-only replicas • Not involved in Consensus Commit • Disposable, suitable for auto-scaling Replica • Small group of Neo4j databases • Fault-tolerant Consensus Commit • Responsible for data safety Core Writing to the Core Cluster – Raft Consensus Commits Neo4j Driver ✓ ✓ ✓ Success Neo4j Cluster Application Server Neo4j Driver Max Jim Jane Mar k Routed write statements driver = GraphDatabase.driver( "bolt+routing://aCoreServer" ); try ( Session session = driver.session( AccessMode.WRITE ) ) { try ( Transaction tx = session.beginTransaction() ) { tx.run( "MERGE (user:User {userId: {userId}})“, parameters( "userId", userId ) ); tx.success(); } } Bookmark • Session token • String (for portability) • Opaque to application • Represents ultimate user’s most recent view of the graph • More capabilities to come Data Redundancy Massive Throughput High Availability 3.0 Bigger Clusters Consensus Commit Built-in load balancing 3.1Causal Clusteri ng Flexible Authentication Options Choose authentication method • Built-in native users repository Testing/POC, single-instance deployments • LDAP connector to Active Directory or openLDAP Production deployments • Custom auth provider plugins Special deployment scenarios 128 Custom Plugin Active Directory openLDAP LDAP connector LDAP connector Auth Plugin Extension Module Built-in Native Users Neo4j Built-in Native Users Auth Plugin Extension Module 129 Flexible Authentication Options LDAP Group to Role Mapping dbms.security.ldap.authorization.group_to_role_mapping= \ "CN=Neo4j Read Only,OU=groups,DC=example,DC=com" = reader; \ "CN=Neo4j Read-Write,OU=groups,DC=example,DC=com" = publisher; \ "CN=Neo4j Schema Manager,OU=groups,DC=example,DC=com" = architect; \ "CN=Neo4j Administrator,OU=groups,DC=example,DC=com" = admin; \ "CN=Neo4j Procedures,OU=groups,DC=example,DC=com" = allowed_role ./conf/neo4j.conf CN=Bob Smith CN=Carl JuniorOU=people DC=example DC=com BASE DN OU=groups CN=Neo4j Read Only CN=Neo4j Read-Write CN=Neo4j Schema Manager CN=Neo4j Administrator CN=Neo4j Procedures Map to Neo4j permissions Use Cases Case Study: Knowledge Graphs at eBay Case Study: Knowledge Graphs at eBay Case Study: Knowledge Graphs at eBay Case Study: Knowledge Graphs at eBay Bags Men’s Backpack Handbag Case Study: Knowledge Graphs at eBay Case studySolving real-time recommendations for the World’s largest retailer. Challenge • In its drive to provide the best web experience for its customers, Walmart wanted to optimize its online recommendations. • Walmart recognized the challenge it faced in delivering recommendations with traditional relational database technology. • Walmart uses Neo4j to quickly query customers’ past purchases, as well as instantly capture any new interests shown in the customers’ current online visit – essential for making real-time recommendations. Use of Neo4j “As the current market leader in graph databases, and with enterprise features for scalability and availability, Neo4j is the right choice to meet our demands”. - Marcos Vada, Walmart • With Neo4j, Walmart could substitute a heavy batch process with a simple and real-time graph database. Result/Outcome Case studyeBay Now Tackles eCommerce Delivery Service Routing with Neo4j Challenge • The queries used to select the best courier for eBays routing system were simply taking too long and they needed a solution to maintain a competitive service. • The MySQL joins being used created a code base too slow and complex to maintain. • eBay is now using Neo4j’s graph database platform to redefine e-commerce, by making delivery of online and mobile orders quick and convenient. Use of Neo4j • With Neo4j eBay managed to eliminate the biggest roadblock between retailers and online shoppers: the option to have your item delivered the same day. • The schema-flexible nature of the database allowed easy extensibility, speeding up development. • Neo4j solution was more than 1000x faster than the prior MySQL Soltution. Our Neo4j solution is literally thousands of times faster than the prior MySQL solution, with queries that require 10-100 times less code. Result/Outcome – Volker Pacher, eBay Top Tier US Retailer Case studySolving Real-time promotions for a top US retailer Challenge • Suffered significant revenues loss, due to legacy infrastructure. • Particularly challenging when handling transaction volumes on peak shopping occasions such as Thanksgiving and Cyber Monday. • Neo4j is used to revolutionize and reinvent its real-time promotions engine. • On an average Neo4j processes 90% of this retailer’s 35M+ daily transactions, each 3-22 hops, in 4ms or less. Use of Neo4j • Reached an all time high in online revenues, due to the Neo4j-based friction free solution. • Neo4j also enabled the company to be one of the first retailers to provide the same promotions across both online and traditional retail channels. “On an average Neo4j processes 90% of this retailer’s 35M+ daily transactions, each 3-22 hops, in 4ms or less.” – Top Tier US Retailer Result/Outcome Relational DBs Can’t Handle Relationships Well • Cannot model or store data and relationships without complexity • Performance degrades with number and levels of relationships, and database size • Query complexity grows with need for JOINs • Adding new types of data and relationships requires schema redesign, increasing time to market … making traditional databases inappropriate when data relationships are valuable in real-time Slow development Poor performance Low scalability Hard to maintain Unlocking Value from Your Data Relationships • Model your data as a graph of data and relationships • Use relationship information in real- time to transform your business • Add new relationships on the fly to adapt to your changing business MATCH (sub)-[:REPORTS_TO*0..3]->(boss), (report)-[:REPORTS_TO*1..3]->(sub) WHERE boss.name = "Andrew K." RETURN sub.name AS Subordinate, count(report) AS Total Express Complex Queries Easily with Cypher Find all direct reports and how many people they manage, up to 3 levels down Cypher Query SQL Query http://m.name/