Welcome!

IoT User Interface Authors: Elizabeth White, Liz McMillan, AppDynamics Blog, Yakov Fain, Jason Bloomberg

Blog Feed Post

Lync 2013 High Availability deep dive: Architecture

With the release of Lync 2013 the story of high availability has changed significantly. Since these changes contain some new technologies I think it would be best to break this article out into multiple parts. For this first part, we will discuss the Architecture and that will be followed by the end user experience. There is a lot to discuss so let's get started.

Windows Fabric

One of the major changes is how Lync provides highly available services with a pool. In Lync 2010 this responsibility was handled by Cluster Manager. With Lync 2013 we now leverage the Windows Fabric to provide high availability to these services. The Windows Fabric services allows Lync to increase our scale for both On-Premise and Online deployments. Windows Fabric will handle the following:

  • Election of Primary, Secondary, and Backup Secondary replica
  • Failover Management including Fast Failover with full service
  • Replication between primary and secondary replicas
  • Manages and Monitors which servers are functional and participating in pool operations
  • Services for MCU Factory, Conference Directory, Routing Group, LYSS
  • Load Balancing within a pool

Data Replication within the Pool

  1. Persistent User Data – synchronous replication to all replicas
    1. Presence, Contacts and Groups, and User Voice settings
    2. Minimal Conference data including the Active Conference Roster and MCUs
    3. Transient User Data – not replicated across all replicas
      1. Presence changes due to user activity including Calendar, Inactivity, and Phone Calls.

The fabric will also lazily write this data to the backend end databases. Doing this lazily means when there are free cycles we will use them to replicate this data instead of impacting the server when its busy doing other replication to front ends. The backend data can be used for both rehydration of front end servers (restarts) and disaster recovery (pool failovers and failbacks).

The following ports must be opened between the Lync front end servers within a pool for the fabric replication process to be successful.

  • Federation (TCP/5090) - Windows Fabric ring communication.
  • Lease Agent (TCP/5091)
  • Client Connection (TCP/5092) - These are not user clients but Lync Services that register to Windows Fabric.
  • Runtime Service (TCP/5093)
  • Replication (TCP/5094)

 

Quorum

In order to function the Windows Fabric needs to achieve a status of Quorum. This means a certain number of servers (Voters) must be able to communicate with each other to ensure that the pool is capable of maintaining consistency (Quorum). There are 2 different pool states in which we must maintain\achieve quorum which differ in how many voters must be available.

  1. Cold Pool Startup\Quorum Lost – when a pool is first starting or after quorum has been lost, that pool must have roughly 85% of its servers available to achieve quorum1.
  2. Operating Pool – after a pool has been started we need at least 50% of the servers available to maintain quorum2.

1 This means that when defining a pool within topology builder only add the servers that will be deployed at that time, not in the future. If you put 6 servers in a pool within topology builder and only deploy 3, we will never achieve quorum because we don’t have approximitley 85% of the servers operational. Thus, the front end services will never start.

2 If quorum is lost then all Front End services will be shutdown to protect the cluster from data loss.

To identify the Quorum voters a given servers Windows Fabric identifies check C:\ProgramData\Windows Fabric\Settings.xml. This file is updated when the Windows Fabric Host service starts up, thus any changes to pool members requires a service restart on all Front Ends in the pool, one at a time. The To-Do list in topology builder lets you know that after publishing the topology. If this isn't completed your servers will have a skewed view of the topology which could cause issues.  Depending on the number of servers within the pool, this will dictate which voters are required.

  1. Even # Server Pool (Figure 1) – each server will have 1 vote and the SQL instance will have 1 vote. The SQL server will act as a tiebreaker.
  2. Odd #Server Pool (Figure 2) – each server will have 1 vote.

Figure 1: Even number of servers in Front End Pool

Figure 2: Odd number of servers in Front End pool

 

Routing Groups

When users are provisioned for Lync they are assigned to Routing groups. Each routing group will contain 3 replicas: primary, secondary, and backup secondary which together are called a replica set. User services for all users will be handled by their primary replica. All users in the same routing group will be homed on the primary replica. If the primary replica fails, all users in that Routing Group will failover to the same secondary replica.

Routing groups will get rebalanced when Front End servers are added or removed from the pool. If the primary replica goes down another will be promoted to primary. Depending on both user count and how the primary became unavailable will dictate when the Windows Fabric rebuilds another replica. The replica knows the pools user count and if it decides creating another replica is too expensive of an operation it might wait 15-30 minutes. Also, if the primary server was taken down gracefully for patching (stop-cswindowsservice –graceful), Windows Fabric might wait longer than if the server crashed or was just powered off. We don’t have any control over how Windows Fabric handles these operations.

In regards to SBA and SBS as it pertains to routing groups each unit gets its OWN unique routing group (by design). Those would be in addition to the routing created for users homed on that pool. This means that if you have 3000 users on an SBS all of those users will have the same primary, secondary, and backup secondary replicas. The reason this is important is because this needs to be factored into a design when building out a highly available infrastructure that can support all Lync services.

 

Quorum Loss Recovery

If we lose all replicas within a routing group before Windows Fabric has a chance to rebuild more the replica set will go into quorum loss. When that routing group is in this type of quorum loss users will NOT be able to login. If 2 replicas are lost (1 remaining online) in a short period of time we will also go into quorum loss but users will be logged in with limited functionality mode (no contact list or presence but can still IM). The routing group will stay in that state until either the servers come back online or we manually run Reset-CsPoolRegistrarState with the QuorumLossRecovery option.  Remember these scenarios will not affect all users but only the users that were a part of that routing group.

You can verify the routing groups are functioning correctly by running the Get-CsPoolFabricState command (Figure 3). If there are routing groups in QuorumLoss you will see services that reference "no primaries" (Figure 4).

 

Figure 3: Get-CsPoolFabricState healthy

Figure 4: Get-CsPoolFabricState with routing groups in QuorumLoss

In order to recover from quorum loss you will need to run the Reset-CsPoolRegistrarState cmdlet with a –ResetType of QuorumLossRecovery (Figure 5).

Figure 5: Reset-CsPoolRegistrarState QuorumLossRecovery

The Reset-CsPoolRegistrarState cmdlet has multiple parameters which can be used for various changes to the cluster.

Reset-CsPoolRegistrarState –PoolFqdn <pool name fqdn> –ResetType {FullReset | QuorumLossRecovery | ServiceReset | MachineStateRemoved}

  • FullReset1 - cluster changes, any modification either bringing a pool into or out of 1-2 server pool, and changes to upgrade domains. In addition, this parameter rebuilds the local Lync Server databases. This type of reset can be potentially long and resource-intensive.
  • QuorumLossRecovery - Reloads user data from the backup store for any routing groups currently in quorum loss. (A quorum loss occurs when neither a database nor its replicas are available.) Data not yet written to the database could be lost when you do this type of reset.
  • ServiceReset1 - The RtcSrv and fabricHostSvc services are stopped and restarted. A service reset will be performed if the ResetType is not specified.
  • MachineStateRemoved - Removes the specified server from the pool. This type of reset should be used only when the server in question (or its databases) have been permanently lost.

1 Both the FullReset and ServiceReset will cause services to restart which will cause outages. This is why it's not a good idea to add front end servers to the pool during your production hours. When you add server it states you must restart front end services on all front ends in the pool.

To verify which replicas a user's data resides on we can run the Get-CsUserPoolInfo cmdlet (Figure 3). This is a useful command that will show us a ton of useful info about our registrars. If there aren't any servers listed then the users routing group might be in quorum loss.

Figure 6: Get-CsUserPoolInfo cmdlet

 

Upgrade Domains

Upgrade Domains are logical groupings of servers on which software maintenance such as upgrades and updates are performed at the same time. If you don’t adhere to upgrade domain guidelines for server patching, you will impact routing groups and potentially the pool itself. A routing group will never have two of its three servers in a single upgrade domain.

Depending on how many Front End servers are in a pool when it's created will dictate how many servers will be in each upgrade domain. The algorithm used to generate the upgrade domains is implemented by Topology Builder and can be viewed in the tbxml file after creating the topology. It is not supported to manually edit the tbxml file and change the upgrade domains. In general creating more servers in the pool will result in more servers per upgrade domain. This why it is critical to plan out your front ends in the design phase instead of adding servers here and there. If servers have to be added due to growth then there is nothing that can be done to decrease the amount of upgrade domains windows fabric creates.

So how do we monitor these upgrade domains and incorporate them into our patch process? We can use the Get-CsPoolUpgradeReadinessState cmdlet to display this data (Figure 4). There are a couple different upgrade domain states we might see which are listed below and how to handle each one.

  1. Busy – wait 10 minutes and check again. The fabric might still be balancing services which can cause this to read busy
  2. Busy 3x or InsufficientActiveFrontEnds1– there might be an insufficient amount of active front end servers in the pool or something is wrong with pool\upgrade domain. If this is the case check out Event Viewer to see what issues are occurring on these servers.
  3. Ready – Drain (Stop-CsWindowsService), Patch, and then reboot the server

After you have rebooted the server, just wait a couple of minutes before running the Get-CsPoolUpgradeReadinessState cmdlet again to find out if the pool is ready to take another upgrade domain down. Remember you should only patch ONE upgrade domain at a time.

1 If your front end pool only has 1 servers you will always get InsufficientActiveFrontEnds. In this case you know there will be an outage so you can just patch and restart that server.

 

Figure 7: Get-CsPoolUpgradeReadinessState with expanded properties

 

Best Practices "Amount of Servers in Pool"

Our recommendation is to build pools with a minimum of 3 servers. Although we support 2 server pools there are considerations with this configuration which everyone should be aware of.

  • The first issue is centered on how servers are rebooted and patched. If you following our best practice recommendation (http://technet.microsoft.com/en-us/library/gg412996.aspx) for patching it states both front end servers should be rebooted at the same time. If at any time both servers are taken offline at the same time you must bring them up in reverse order. In other works if you reboot server 1 then server 2 you must bring server 2 back up FIRST. The reason for this is because server 2 would be the last node to write quorum info to the backend and the cluster would not make quorum unless he was started first. If reverse order is not possible, use the Reset-CsPoolRegistrarState cmdlet with the QuorumLossRecovery parameter we discussed above.
  • Also keep in mind the fact that you will only have 2 copies of your data (Primary and Secondary).

 
 Troubleshooting Service Startup

  • If a routing group is down, RtCSrv will not start. (32169 - Server startup is being delayed because fabric pool manager is initializing). Use Reset-CsPoolRegistrarState cmdlet with the QuorumLossRecovery parameter
  • If a pool has gone down and stopped services, it requires 85% of servers to start back up.

 

SQL Queries

List Routings Group Names, Users in those groups, and the primary server for each routing group ordered by home server (Figure 5). This should be run against a front end server rtclocal sql instance dbo.Resource table.

Select FrontEnd.Fqdn, RoutingGroupAssignment.RoutingGroupName, Resource.UserAtHost

From ResourceDirectory

INNER JOIN Resource

ON ResourceDirectory.ResourceId=Resource.ResourceId

INNER JOIN RoutingGroupAssignment

ON ResourceDirectory.RoutingGroupId=RoutingGroupAssignment.RoutingGroupId

INNER JOIN FrontEnd

ON RoutingGroupAssignment.FrontEndId=FrontEnd.FrontEndId

ORDER BY FrontEnd.Fqdn

 

Figure 8: SQL query to list routing groups, primary server, and users

Thanks to both Bryan Nyce and Daniel Hernandez for helping document this popular technical information request. Also thanks to Tommy Mhire for providing some of this excellent material.

Read the original blog entry...

More Stories By Richard Schwendiman

My name is Richard Schwendiman and I am currently working for Microsoft as a (PFE) Premier Field Engineer specializing in both Exchange and Lync. I have been working as an IT Consultant for 13+ years focusing on a wide array of Infrastructure technologies. These technologies include Messaging, UC, Networking, Platforms, Active Directory, Virtualization, etc... I am currently certified as an MCSE (Microsoft Certified Systems Engineer), MCSE Messaging 2013, MCSE Communications 2013, MCSA 2012, MCITP Enterprise Messaging, MCTS-Lync, CCNA (Cisco Certified Network Associate), Commvault, CCNP (Cisco Certified Network Professional), and JNCIA-ER (Juniper Enterprise Routing). I am hoping that through this blog I can bring knowledge from the field and keep everyone informed about our ever changing Industry. Please feel free to email me any questions, comments, or concerns pertaining to this blog or any technology related things. Thanks and look forward to providing some good content. http://blogs.technet.com/b/rischwen/

@CloudExpo Stories
As machines are increasingly connected to the internet, it’s becoming easier to discover the numerous ways Industrial IoT (IIoT) is helping to shape the business world. This is exactly why we have decided to take a closer look at this pervasive movement and to examine the desire to connect more things! Now if you need a refresher on IIoT and how it is changing the world, take a moment and listen to Greg Gorbach with ARC Advisory Group. Gorbach believes, "IIoT will significantly change the worl...
Enterprise networks are complex. Moreover, they were designed and deployed to meet a specific set of business requirements at a specific point in time. But, the adoption of cloud services, new business applications and intensifying security policies, among other factors, require IT organizations to continuously deploy configuration changes. Therefore, enterprises are looking for better ways to automate the management of their networks while still leveraging existing capabilities, optimizing perf...
Designing IoT applications is complex, but deploying them in a scalable fashion is even more complex. A scalable, API first IaaS cloud is a good start, but in order to understand the various components specific to deploying IoT applications, one needs to understand the architecture of these applications and figure out how to scale these components independently. In his session at @ThingsExpo, Nara Rajagopalan is CEO of Accelerite, will discuss the fundamental architecture of IoT applications, ...
As cloud and storage projections continue to rise, the number of organizations moving to the cloud is escalating and it is clear cloud storage is here to stay. However, is it secure? Data is the lifeblood for government entities, countries, cloud service providers and enterprises alike and losing or exposing that data can have disastrous results. There are new concepts for data storage on the horizon that will deliver secure solutions for storing and moving sensitive data around the world. ...
In his session at 18th Cloud Expo, Bruce Swann, Senior Product Marketing Manager at Adobe, will discuss how the Adobe Marketing Cloud can help marketers embrace opportunities for personalized, relevant and real-time customer engagement across offline (direct mail, point of sale, call center) and digital (email, website, SMS, mobile apps, social networks, connected objects). Bruce Swann has more than 15 years of experience working with digital marketing disciplines like web analytics, social med...
Many banks and financial institutions are experimenting with containers in development environments, but when will they move into production? Containers are seen as the key to achieving the ultimate in information technology flexibility and agility. Containers work on both public and private clouds, and make it easy to build and deploy applications. The challenge for regulated industries is the cost and complexity of container security compliance. VM security compliance is already challenging, ...
The essence of data analysis involves setting up data pipelines that consist of several operations that are chained together – starting from data collection, data quality checks, data integration, data analysis and data visualization (including the setting up of interaction paths in that visualization). In our opinion, the challenges stem from the technology diversity at each stage of the data pipeline as well as the lack of process around the analysis.
What a difference a year makes. Organizations aren’t just talking about IoT possibilities, it is now baked into their core business strategy. With IoT, billions of devices generating data from different companies on different networks around the globe need to interact. From efficiency to better customer insights to completely new business models, IoT will turn traditional business models upside down. In the new customer-centric age, the key to success is delivering critical services and apps wit...
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York and Silicon Valley. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place Nov 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 17th Cloud Expo and will feature technical sessions from a rock star conference faculty ...
The 19th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit y...
Internet of @ThingsExpo, taking place November 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 19th International Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world and ThingsExpo New York Call for Papers is now open.
There are several IoTs: the Industrial Internet, Consumer Wearables, Wearables and Healthcare, Supply Chains, and the movement toward Smart Grids, Cities, Regions, and Nations. There are competing communications standards every step of the way, a bewildering array of sensors and devices, and an entire world of competing data analytics platforms. To some this appears to be chaos. In this power panel at @ThingsExpo, moderated by Conference Chair Roger Strukhoff, panelists will discuss the vast to...
SYS-CON Events announced today that Enzu, a leading provider of cloud hosting solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Enzu’s mission is to be the leading provider of enterprise cloud solutions worldwide. Enzu enables online businesses to use its IT infrastructure to their competitive advantage. By offering a suite of proven hosting and management services, Enzu wants companies to foc...
SYS-CON Events announced today the How to Create Angular 2 Clients for the Cloud Workshop, being held June 7, 2016, in conjunction with 18th Cloud Expo | @ThingsExpo, at the Javits Center in New York, NY. Angular 2 is a complete re-write of the popular framework AngularJS. Programming in Angular 2 is greatly simplified. Now it’s a component-based well-performing framework. The immersive one-day workshop led by Yakov Fain, a Java Champion and a co-founder of the IT consultancy Farata Systems and...
Customer experience has become a competitive differentiator for companies, and it’s imperative that brands seamlessly connect the customer journey across all platforms. With the continued explosion of IoT, join us for a look at how to build a winning digital foundation in the connected era – today and in the future. In his session at @ThingsExpo, Chris Nguyen, Group Product Marketing Manager at Adobe, will discuss how to successfully leverage mobile, rapidly deploy content, capture real-time d...
IoT generates lots of temporal data. But how do you unlock its value? How do you coordinate the diverse moving parts that must come together when developing your IoT product? What are the key challenges addressed by Data as a Service? How does cloud computing underlie and connect the notions of Digital and DevOps What is the impact of the API economy? What is the business imperative for Cognitive Computing? Get all these questions and hundreds more like them answered at the 18th Cloud Expo...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, will provide an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life ...
@DevOpsSummit taking place June 7-9, 2016 at Javits Center, New York City, and Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 18th International @CloudExpo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
The cloud market growth today is largely in public clouds. While there is a lot of spend in IT departments in virtualization, these aren’t yet translating into a true “cloud” experience within the enterprise. What is stopping the growth of the “private cloud” market? In his general session at 18th Cloud Expo, Nara Rajagopalan, CEO of Accelerite, will explore the challenges in deploying, managing, and getting adoption for a private cloud within an enterprise. What are the key differences betwee...
SYS-CON Events announced today that ContentMX, the marketing technology and services company with a singular mission to increase engagement and drive more conversations for enterprise, channel and SMB technology marketers, has been named “Sponsor & Exhibitor Lounge Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York City, New York. “CloudExpo is a great opportunity to start a conversation with new prospects, but what happens after the...