Welcome!

Machine Learning Authors: Ed Featherston, Liz McMillan, Elizabeth White, Dan Blacharski, William Schmarzo

Blog Feed Post

Lync 2013 High Availability deep dive: Architecture

With the release of Lync 2013 the story of high availability has changed significantly. Since these changes contain some new technologies I think it would be best to break this article out into multiple parts. For this first part, we will discuss the Architecture and that will be followed by the end user experience. There is a lot to discuss so let's get started.

Windows Fabric

One of the major changes is how Lync provides highly available services with a pool. In Lync 2010 this responsibility was handled by Cluster Manager. With Lync 2013 we now leverage the Windows Fabric to provide high availability to these services. The Windows Fabric services allows Lync to increase our scale for both On-Premise and Online deployments. Windows Fabric will handle the following:

  • Election of Primary, Secondary, and Backup Secondary replica
  • Failover Management including Fast Failover with full service
  • Replication between primary and secondary replicas
  • Manages and Monitors which servers are functional and participating in pool operations
  • Services for MCU Factory, Conference Directory, Routing Group, LYSS
  • Load Balancing within a pool

Data Replication within the Pool

  1. Persistent User Data – synchronous replication to all replicas
    1. Presence, Contacts and Groups, and User Voice settings
    2. Minimal Conference data including the Active Conference Roster and MCUs
    3. Transient User Data – not replicated across all replicas
      1. Presence changes due to user activity including Calendar, Inactivity, and Phone Calls.

The fabric will also lazily write this data to the backend end databases. Doing this lazily means when there are free cycles we will use them to replicate this data instead of impacting the server when its busy doing other replication to front ends. The backend data can be used for both rehydration of front end servers (restarts) and disaster recovery (pool failovers and failbacks).

The following ports must be opened between the Lync front end servers within a pool for the fabric replication process to be successful.

  • Federation (TCP/5090) - Windows Fabric ring communication.
  • Lease Agent (TCP/5091)
  • Client Connection (TCP/5092) - These are not user clients but Lync Services that register to Windows Fabric.
  • Runtime Service (TCP/5093)
  • Replication (TCP/5094)

 

Quorum

In order to function the Windows Fabric needs to achieve a status of Quorum. This means a certain number of servers (Voters) must be able to communicate with each other to ensure that the pool is capable of maintaining consistency (Quorum). There are 2 different pool states in which we must maintain\achieve quorum which differ in how many voters must be available.

  1. Cold Pool Startup\Quorum Lost – when a pool is first starting or after quorum has been lost, that pool must have roughly 85% of its servers available to achieve quorum1.
  2. Operating Pool – after a pool has been started we need at least 50% of the servers available to maintain quorum2.

1 This means that when defining a pool within topology builder only add the servers that will be deployed at that time, not in the future. If you put 6 servers in a pool within topology builder and only deploy 3, we will never achieve quorum because we don’t have approximitley 85% of the servers operational. Thus, the front end services will never start.

2 If quorum is lost then all Front End services will be shutdown to protect the cluster from data loss.

To identify the Quorum voters a given servers Windows Fabric identifies check C:\ProgramData\Windows Fabric\Settings.xml. This file is updated when the Windows Fabric Host service starts up, thus any changes to pool members requires a service restart on all Front Ends in the pool, one at a time. The To-Do list in topology builder lets you know that after publishing the topology. If this isn't completed your servers will have a skewed view of the topology which could cause issues.  Depending on the number of servers within the pool, this will dictate which voters are required.

  1. Even # Server Pool (Figure 1) – each server will have 1 vote and the SQL instance will have 1 vote. The SQL server will act as a tiebreaker.
  2. Odd #Server Pool (Figure 2) – each server will have 1 vote.

Figure 1: Even number of servers in Front End Pool

Figure 2: Odd number of servers in Front End pool

 

Routing Groups

When users are provisioned for Lync they are assigned to Routing groups. Each routing group will contain 3 replicas: primary, secondary, and backup secondary which together are called a replica set. User services for all users will be handled by their primary replica. All users in the same routing group will be homed on the primary replica. If the primary replica fails, all users in that Routing Group will failover to the same secondary replica.

Routing groups will get rebalanced when Front End servers are added or removed from the pool. If the primary replica goes down another will be promoted to primary. Depending on both user count and how the primary became unavailable will dictate when the Windows Fabric rebuilds another replica. The replica knows the pools user count and if it decides creating another replica is too expensive of an operation it might wait 15-30 minutes. Also, if the primary server was taken down gracefully for patching (stop-cswindowsservice –graceful), Windows Fabric might wait longer than if the server crashed or was just powered off. We don’t have any control over how Windows Fabric handles these operations.

In regards to SBA and SBS as it pertains to routing groups each unit gets its OWN unique routing group (by design). Those would be in addition to the routing created for users homed on that pool. This means that if you have 3000 users on an SBS all of those users will have the same primary, secondary, and backup secondary replicas. The reason this is important is because this needs to be factored into a design when building out a highly available infrastructure that can support all Lync services.

 

Quorum Loss Recovery

If we lose all replicas within a routing group before Windows Fabric has a chance to rebuild more the replica set will go into quorum loss. When that routing group is in this type of quorum loss users will NOT be able to login. If 2 replicas are lost (1 remaining online) in a short period of time we will also go into quorum loss but users will be logged in with limited functionality mode (no contact list or presence but can still IM). The routing group will stay in that state until either the servers come back online or we manually run Reset-CsPoolRegistrarState with the QuorumLossRecovery option.  Remember these scenarios will not affect all users but only the users that were a part of that routing group.

You can verify the routing groups are functioning correctly by running the Get-CsPoolFabricState command (Figure 3). If there are routing groups in QuorumLoss you will see services that reference "no primaries" (Figure 4).

 

Figure 3: Get-CsPoolFabricState healthy

Figure 4: Get-CsPoolFabricState with routing groups in QuorumLoss

In order to recover from quorum loss you will need to run the Reset-CsPoolRegistrarState cmdlet with a –ResetType of QuorumLossRecovery (Figure 5).

Figure 5: Reset-CsPoolRegistrarState QuorumLossRecovery

The Reset-CsPoolRegistrarState cmdlet has multiple parameters which can be used for various changes to the cluster.

Reset-CsPoolRegistrarState –PoolFqdn <pool name fqdn> –ResetType {FullReset | QuorumLossRecovery | ServiceReset | MachineStateRemoved}

  • FullReset1 - cluster changes, any modification either bringing a pool into or out of 1-2 server pool, and changes to upgrade domains. In addition, this parameter rebuilds the local Lync Server databases. This type of reset can be potentially long and resource-intensive.
  • QuorumLossRecovery - Reloads user data from the backup store for any routing groups currently in quorum loss. (A quorum loss occurs when neither a database nor its replicas are available.) Data not yet written to the database could be lost when you do this type of reset.
  • ServiceReset1 - The RtcSrv and fabricHostSvc services are stopped and restarted. A service reset will be performed if the ResetType is not specified.
  • MachineStateRemoved - Removes the specified server from the pool. This type of reset should be used only when the server in question (or its databases) have been permanently lost.

1 Both the FullReset and ServiceReset will cause services to restart which will cause outages. This is why it's not a good idea to add front end servers to the pool during your production hours. When you add server it states you must restart front end services on all front ends in the pool.

To verify which replicas a user's data resides on we can run the Get-CsUserPoolInfo cmdlet (Figure 3). This is a useful command that will show us a ton of useful info about our registrars. If there aren't any servers listed then the users routing group might be in quorum loss.

Figure 6: Get-CsUserPoolInfo cmdlet

 

Upgrade Domains

Upgrade Domains are logical groupings of servers on which software maintenance such as upgrades and updates are performed at the same time. If you don’t adhere to upgrade domain guidelines for server patching, you will impact routing groups and potentially the pool itself. A routing group will never have two of its three servers in a single upgrade domain.

Depending on how many Front End servers are in a pool when it's created will dictate how many servers will be in each upgrade domain. The algorithm used to generate the upgrade domains is implemented by Topology Builder and can be viewed in the tbxml file after creating the topology. It is not supported to manually edit the tbxml file and change the upgrade domains. In general creating more servers in the pool will result in more servers per upgrade domain. This why it is critical to plan out your front ends in the design phase instead of adding servers here and there. If servers have to be added due to growth then there is nothing that can be done to decrease the amount of upgrade domains windows fabric creates.

So how do we monitor these upgrade domains and incorporate them into our patch process? We can use the Get-CsPoolUpgradeReadinessState cmdlet to display this data (Figure 4). There are a couple different upgrade domain states we might see which are listed below and how to handle each one.

  1. Busy – wait 10 minutes and check again. The fabric might still be balancing services which can cause this to read busy
  2. Busy 3x or InsufficientActiveFrontEnds1– there might be an insufficient amount of active front end servers in the pool or something is wrong with pool\upgrade domain. If this is the case check out Event Viewer to see what issues are occurring on these servers.
  3. Ready – Drain (Stop-CsWindowsService), Patch, and then reboot the server

After you have rebooted the server, just wait a couple of minutes before running the Get-CsPoolUpgradeReadinessState cmdlet again to find out if the pool is ready to take another upgrade domain down. Remember you should only patch ONE upgrade domain at a time.

1 If your front end pool only has 1 servers you will always get InsufficientActiveFrontEnds. In this case you know there will be an outage so you can just patch and restart that server.

 

Figure 7: Get-CsPoolUpgradeReadinessState with expanded properties

 

Best Practices "Amount of Servers in Pool"

Our recommendation is to build pools with a minimum of 3 servers. Although we support 2 server pools there are considerations with this configuration which everyone should be aware of.

  • The first issue is centered on how servers are rebooted and patched. If you following our best practice recommendation (http://technet.microsoft.com/en-us/library/gg412996.aspx) for patching it states both front end servers should be rebooted at the same time. If at any time both servers are taken offline at the same time you must bring them up in reverse order. In other works if you reboot server 1 then server 2 you must bring server 2 back up FIRST. The reason for this is because server 2 would be the last node to write quorum info to the backend and the cluster would not make quorum unless he was started first. If reverse order is not possible, use the Reset-CsPoolRegistrarState cmdlet with the QuorumLossRecovery parameter we discussed above.
  • Also keep in mind the fact that you will only have 2 copies of your data (Primary and Secondary).

 
 Troubleshooting Service Startup

  • If a routing group is down, RtCSrv will not start. (32169 - Server startup is being delayed because fabric pool manager is initializing). Use Reset-CsPoolRegistrarState cmdlet with the QuorumLossRecovery parameter
  • If a pool has gone down and stopped services, it requires 85% of servers to start back up.

 

SQL Queries

List Routings Group Names, Users in those groups, and the primary server for each routing group ordered by home server (Figure 5). This should be run against a front end server rtclocal sql instance dbo.Resource table.

Select FrontEnd.Fqdn, RoutingGroupAssignment.RoutingGroupName, Resource.UserAtHost

From ResourceDirectory

INNER JOIN Resource

ON ResourceDirectory.ResourceId=Resource.ResourceId

INNER JOIN RoutingGroupAssignment

ON ResourceDirectory.RoutingGroupId=RoutingGroupAssignment.RoutingGroupId

INNER JOIN FrontEnd

ON RoutingGroupAssignment.FrontEndId=FrontEnd.FrontEndId

ORDER BY FrontEnd.Fqdn

 

Figure 8: SQL query to list routing groups, primary server, and users

Thanks to both Bryan Nyce and Daniel Hernandez for helping document this popular technical information request. Also thanks to Tommy Mhire for providing some of this excellent material.

Read the original blog entry...

More Stories By Richard Schwendiman

My name is Richard Schwendiman and I am currently working for Microsoft as a (PFE) Premier Field Engineer specializing in both Exchange and Lync. I have been working as an IT Consultant for 13+ years focusing on a wide array of Infrastructure technologies. These technologies include Messaging, UC, Networking, Platforms, Active Directory, Virtualization, etc... I am currently certified as an MCSE (Microsoft Certified Systems Engineer), MCSE Messaging 2013, MCSE Communications 2013, MCSA 2012, MCITP Enterprise Messaging, MCTS-Lync, CCNA (Cisco Certified Network Associate), Commvault, CCNP (Cisco Certified Network Professional), and JNCIA-ER (Juniper Enterprise Routing). I am hoping that through this blog I can bring knowledge from the field and keep everyone informed about our ever changing Industry. Please feel free to email me any questions, comments, or concerns pertaining to this blog or any technology related things. Thanks and look forward to providing some good content. http://blogs.technet.com/b/rischwen/

@CloudExpo Stories
"Storpool does only block-level storage so we do one thing extremely well. The growth in data is what drives the move to software-defined technologies in general and software-defined storage," explained Boyan Ivanov, CEO and co-founder at StorPool, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory? In her Day 2 Keynote at @DevOpsSummit at 21st Cloud Expo, Aruna Ravichandran, VP, DevOps Solutions Marketing, CA Technologies, was jo...
As Marc Andreessen says software is eating the world. Everything is rapidly moving toward being software-defined – from our phones and cars through our washing machines to the datacenter. However, there are larger challenges when implementing software defined on a larger scale - when building software defined infrastructure. In his session at 16th Cloud Expo, Boyan Ivanov, CEO of StorPool, provided some practical insights on what, how and why when implementing "software-defined" in the datacent...
Blockchain. A day doesn’t seem to go by without seeing articles and discussions about the technology. According to PwC executive Seamus Cushley, approximately $1.4B has been invested in blockchain just last year. In Gartner’s recent hype cycle for emerging technologies, blockchain is approaching the peak. It is considered by Gartner as one of the ‘Key platform-enabling technologies to track.’ While there is a lot of ‘hype vs reality’ discussions going on, there is no arguing that blockchain is b...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
Is advanced scheduling in Kubernetes achievable?Yes, however, how do you properly accommodate every real-life scenario that a Kubernetes user might encounter? How do you leverage advanced scheduling techniques to shape and describe each scenario in easy-to-use rules and configurations? In his session at @DevOpsSummit at 21st Cloud Expo, Oleg Chunikhin, CTO at Kublr, answered these questions and demonstrated techniques for implementing advanced scheduling. For example, using spot instances and co...
The use of containers by developers -- and now increasingly IT operators -- has grown from infatuation to deep and abiding love. But as with any long-term affair, the honeymoon soon leads to needing to live well together ... and maybe even getting some relationship help along the way. And so it goes with container orchestration and automation solutions, which are rapidly emerging as the means to maintain the bliss between rapid container adoption and broad container use among multiple cloud host...
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
Imagine if you will, a retail floor so densely packed with sensors that they can pick up the movements of insects scurrying across a store aisle. Or a component of a piece of factory equipment so well-instrumented that its digital twin provides resolution down to the micrometer.
The need for greater agility and scalability necessitated the digital transformation in the form of following equation: monolithic to microservices to serverless architecture (FaaS). To keep up with the cut-throat competition, the organisations need to update their technology stack to make software development their differentiating factor. Thus microservices architecture emerged as a potential method to provide development teams with greater flexibility and other advantages, such as the abili...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settle...
Product connectivity goes hand and hand these days with increased use of personal data. New IoT devices are becoming more personalized than ever before. In his session at 22nd Cloud Expo | DXWorld Expo, Nicolas Fierro, CEO of MIMIR Blockchain Solutions, will discuss how in order to protect your data and privacy, IoT applications need to embrace Blockchain technology for a new level of product security never before seen - or needed.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...