Colleagues,
Sharing this with community in case someone has answers, or can confirm my findings.
It is a long case description, but this is because it has all the technical details that would make sense to experts, and improve chances of resolution.
---------------
I have got 4G LTE Cellspot V1 a couple of weeks ago, and was trying to make it work since then. It just won't go past blinking green+orange lights; sometimes orange light would go solid for a while, but then comes back to blinking for most of the time. Which points to problems in communications between the device and TM servers.
Connection diagram:
"4G LTE Cellspot V1"
-> "Mikrotik RB2011" (primary router / firewall / NAT, with public IP; UDP ports 500, 4500, 123 verified open through NAT)
-> "Netgear GS108T" (managed switch, with mirror port for packet capture)
-> "Technicolor TC8717T" (cable modem in "dumb bridge mode", i.e. no additional routing / filtering / NAT'ing)
-> [100Mbps coax cable link]
-> Spectrum (TWC) ISP
-> T-Mobile servers
Finally this weekend I've got some time to dig this further, looking at packets logs on the Mikrotik router itself, and running a Wireshark packet capture between the router and modem (notebook on mirrored switch port, getting a copy of all the traffic from router port).
Looking at the packet captures reveals an interesting picture of attempts to establish IPSEC / NAT-T tunnel:
1) The initial IKE_SA_INIT request-response sequence via UDP 500 works fine every time, no issues here. I can see packets going correctly in both directions.
2) From IKE_SA_INIT server response, Cellspot device understands that NAT is involved, so it correctly switches to using NAT-T, i.e. encapsulating all further ISAKMP and ESP communications into UDP 4500.
3) The next stage is IKE_AUTH, where Cellspot client is supposed to send its identity and certificate to authenticate with the server, and get server's identity and certificate in response, to establish mutual trust.
This is where the problems start.
Likely due to the size of certificates, the next ISAKMP messages sent over UDP 4500 are larger than IP MTU, which means that such larger UDP packets have to be fragmented. I see packets from 1900 to 4000+ bytes (for whatever reason, the size of these auth packets is not consistent, keeps changing from attempt to attempt, and there's no way to check why since content is encrypted).
Fragmentation is not a problem for the Mikrotik router / firewall / NAT (which knows how to reassemble / filter / NAT / fragment), so I see these larger packets flowing reassembled through the NAT, and then I see them again properly fragmented in the Wireshark packet capture going from the router to the cable modem.
The problem is that there are no responses coming back in 99% of cases. Either fragments are dropped somewhere within Spectrum or T-Mobile network (which is quite possible, for example if some firewall is doing "fragmentation attack" threat protection too aggressively), or T-Mobile servers do not like the auth requests they get from my Cellspot device. But the end result is one of these few cases, none of which leads to success:
- 99% cases: there's no IKE_AUTH response from the server; Cellspot device retries IKE_AUTH request 5 times, gets no responses from server, and then gives up and does nothing; after ~2.5 minutes of silence, server inquires with INFORMATIONAL message over UDP 4500 (to check if the client is still alive), and Cellspot client promptly responds, but there is no further activity; another ~1.5 minutes later the client just starts another full cycle with UDP 500 / IKE_SA_INIT message, and the same erratic behavior is repeated all over.
- 0.5% cases: occasionally, even a larger IKE_AUTH request finally gets IKE_AUTHresponse from the server, and then the whole sequence proceeds to the next stage, which is ESP encrypted data flowing over UDP 4500; but even in this case, the data flows for a few seconds and then stops; client and server will then exchange INFORMATIONAL messages (successfully) to check that both are alive, but would not resume ESP data exchange; after a while, the whole thing restarts back at UDP 500 / IKE_SA_INIT message; (I think this is the cases when orange light gets solid for a while);
- 0.5% cases: rarely, the initial IKE_AUTH request message is smaller than MTU, so it is not fragmented; in such cases, IKE_AUTH request _always_ gets a corresponding IKE_AUTH response from the server immediately, but the payload in that response is "Notify (41)" instead of expected "Identification - Responder (36)", which obviously means that the server is unhappy with the request; in these cases, Cellspot device is not retrying IKE_AUTH requests, instead it restarts with UDP 500 / IKE_SA_INIT message immediately.
-----------------------------
All these cases and behaviors show that there is no general connectivity issues between Cellspot and T-Mobile servers over UDP ports 500 and 4500, packets smaller than MTU flow properly in both directions, both endpoints understand each other, recognize that they are talking through NAT, and occasionally even get past IPSEC authentication stages and start sending the actual ESP data.
But it looks like packets larger than MTU (>1500 bytes) get dropped in many (or most) cases somewhere within Spectrum or T-Mobile networks, and as a result the IPSEC tunnel cannot properly be established, or gets broken shortly after being established (depending on which larger packet gets dropped and at which stage). I cannot be sure that this is the case, because I can only observe traffic leaving and entering my home network, but cannot see what happens beyond that - can only speculate, based on behavior observations.
If it is indeed a UDP packet fragmentation related issue, it seems like the standard solution to avoid such UDP fragmentation issues is to enable IPSEC "IKE Fragmentation" feature on the client (in this case - Cellspot device), so that the fragmentation would happen on IPSEC layer, and resulting UDP packets would be guaranteed below MTU. But since this is a "black box" device that cannot be managed by the user (me) - I can't even check if that is doable on this device, let alone make the config change...
Also possible that T-Mobile servers just don't accept the device identity, i.e. authentication or authorization logically fails. This would be strange though, since my T-Mobile profile page correctly lists this device under "Coverage Device" section.
I'm hoping that real TMO technical experts sometimes read these forums, and would chime in to help. I know I should probably call TMO support hotline, but it's such an extreme torture, that it would probably be considered "cruel and unusual punishment" by any US court of law.
Any constructive ideas welcome, as well as any technical confirmations of similar behavior (from those who are capable of running and analyzing packet captures).
Dmitry